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Abstract 

One  of  the  core  constructs  of  High  Performance  Fortran  (HPF)  is  the  array-slice  assignment  statement,  combined  with 
the  rich  choice  of  data  distribution  options  available  to  the  programmer.  On  a  private  memory  multicomputer,  the 
HPF  compiler  writer  faces  the  difficult  task  of  automatically  generating  the  necessary  communication  for  assignment 
statements  involving  arrays  with  arbitrary  block-cyclic  data  distributions.  In  this  paper  we  present  a  framework  for 
representing  array  slices  and  block-cyclic  distributions,  and  we  derive  efficient  algorithms  for  sending  and  receiving  the 
necessary  data  for  array-slice  assignment  statements.  The  algorithms  include  a  memory -efficient  method  of  managing 
the  layout  of  the  distributed  arrays  in  each  processor’s  local  memory.  We  also  provide  a  means  of  converting  the 
user’s  TEMPLATE,  ALIGN,  and  DISTRIBUTE  statements  into  a  convenient  array  ownership  descriptor.  In  addition, 
we  present  several  optimizations  for  common  distributions  and  easily-recognized  communication  patterns.  The  work 
presented  makes  minimal  assumptions  regarding  the  processor  architecture,  the  communication  architecture,  or  the 
underlying  language  being  compiled. 
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1  Introduction 


Private  memory  multiprocessors  have  made  it  possible  to  solve  larger  problems  more  quickly  compared 
to  single-processor  computers.  Problems  can  be  solved  more  quickly  by  distributing  the  work  among  the 
processors.  Problems  with  larger  memory  requirements  can  be  solved  due  to  the  fact  that  each  individual 
processor  can  typically  have  an  amount  of  private  memory  on  the  order  of  that  of  a  single-processor  system, 
so  the  total  amount  of  memory  in  the  system  scales  up  with  the  number  of  processors. 

When  mapping  a  scientific  application  onto  such  a  parallel  machine,  a  common  approach  is  to  distribute 
the  array  elements  among  the  individual  processors.  When  performing  a  piece  of  the  computation,  paral¬ 
lelism  is  obtained  when  related  data  are  found  on  the  same  processor,  and  the  absence  of  data  dependences 
allows  most  or  all  of  the  processors  to  compute  in  parallel.  Whenever  nonlocal  data  is  needed  for  a  com¬ 
putation,  communication  is  performed.  In  'his  approach,  all  processors  follow  essentially  the  same  thread 
of  control,  so  it  usually  suffices  to  write  a  single  source  program  which  is  executed  on  all  processors.  With 
a  careful  data  distribution,  the  computation  can  often  be  distributed  evenly  among  the  processors  while 
maximizing  the  data  locality. 

The  problem  with  this  approach  is  that  it  is  usually  time-consuming  and  error-prone  for  the  programmer, 
who  must  manage  the  memory  layout  and  the  communication  within  the  program.  The  difficulty  of 
implementing  this  management  correctly  can  discourage  the  programmer  from  experimenting  with  different 
distributions  of  data,  or  from  even  attempting  to  parallelize  the  program  in  the  first  place. 

An  obvious  solution  to  this  problem  is  to  have  a  compiler  which  automatically  manages  the  memory 
layout  and  the  communication.  This  approach  frees  the  programmer  from  these  concerns,  and  allows  the 
programmer  to  concentrate  on  the  more  important  aspects  of  the  algorithm  and  to  experiment  with  different 
methods  of  data  distribution. 

There  have  been  several  compiler  projects  with  this  goal.  The  AL  compiler  [17]  for  the  Warp  systolic 
array  allows  the  programmer  to  distribute  a  single  dimension  of  an  array  across  the  processors,  to  specify 
array  elements  that  are  used  in  the  same  computation,  and  to  distribute  computation  across  the  processors. 
In  addition,  AL  allows  the  specification  of  a  window  relation  which  allows  data  at  block  boundaries  to  be 
shared  by  several  processors.  Our  framework  allows  window  relations  in  the  form  of  left  and  right  overlap 
specified  in  the  ALIGN  statement  (see  Section  !  1).  CM  Fortran  [16]  allows  the  programmer  to  specify 
parallelism  in  the  form  of  Fortran  90  [1]  array  statements,  and  it  gives  the  programmer  some  freedom  in 
choosing  a  data  distribution.  Fortran  D  [9]  provides  a  rich  set  of  data  distribution  methods,  which  are  applied 
independently  to  each  dimension  of  an  array.  While  Fortran  D  uses  data  dependence  analysis  to  determine 
which  sequential  code  can  be  parallelized,  Fortran  90D  [6]  adds  the  syntax  of  Fortran  90  array  statements  to 
help  derive  parallelism.  Vienna  Fortran  [  1 9]  allows  the  programmer  to  specify  the  same  kinds  of  alignments 
and  distributions  as  in  Fortran  D,  and  in  fact  allows  many  more  kinds,  including  user-specified  alignment 
and  distribution  functions. 

In  this  paper,  we  derive  the  communication  and  the  memory  management  required  to  evaluate  array 
statements  similar  to  those  in  Fortran  90.  The  data  distribution  is  given  in  terms  similar  to  those  in  Fortran  D; 
we  assume  that  each  dimension  of  each  array  is  distributed  in  block-cyclic  fashion.  This  distribution  could 
be  specified  by  the  programmer,  or  the  compiler  could  automatically  derive  distributions,  using  methods 
similar  to  those  of  Chatteijee  etal.  [5]  or  Wholey  [18].  Parallelism  is  expressed  in  terms  of  array  statements, 
as  in  CM  Fortran  and  Fortran  90D.  Thus  our  approach  is  similar  to  that  taken  by  the  High  Performance 
Fortran  Forum  [8],  although  we  do  not  explicitly  assume  Fortran  as  our  input  language.  The  work  of 
automatically  deriving  parallelism  from  sequential  code  is  beyond  the  scope  of  this  paper. 

Chatterjee  et  al.  [4]  present  a  similar  framework  for  compiling  array  assignment  statements,  in  terms  of 
constructing  a  finite  state  machine.  Chatterjee’s  approach  accesses  data  in  a  manner  that  is  more  friendly 
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than  our  approach  to  a  data  cache,  especially  in  the  case  of  block-cyclic  data  distributions.  However,  our 
approach  requires  O(P)  less  buffer  space,  where  P  is  the  number  of  processors,  and  our  approach  allows 
more  overlapping  of  communication  and  computation.  Their  approach  additionally  allows  an  arbitrary 
integer  affine  alignment  function,  e.g.  ALIGN  A (ai  +  b)  WITH  T(t),  whereas  our  approach  only  allows 
a  to  be  0  or  1.  Finally,  although  Chatterjee’s  method  can  handle  the  communication  for  an  arbitrary  array 
assignment  statement,  such  as  the  communication  that  arises  during  a  redistribution,  it  induces  two  integer 
divisions  per  element  within  the  inner  loop,  although  the  extra  cost  can  be  made  more  reasonable  when  the 
divisors  are  known  to  be  powers  of  2. 

Although  other  work  [14,  13]  addresses  compile-time  analysis  of  array  statements  with  block  and 
cyclic  distributions,  our  approach  and  Chatteijee’s  approach  are  the  only  methods  to  date  which  address 
compile-time  analysis  of  block-cyclic  distributions. 

A  quite  different  approach  for  generating  communication  sets  is  taken  by  Saltz  et  al.  [15].  While 
the  methods  described  in  this  paper  use  compile-time  analysis  exclusively,  their  method  invokes  runtime 
analysis,  in  the  form  of  the  inspector/executor  model,  to  generate  the  communication.  Compile-time  analysis 
is  likely  to  produce  faster  code,  provided  that  sufficient  access  and  distribution  information  is  known  at 
compile  time,  although  it  does  not  extend  well  to  handle  the  important  class  of  unstructured  and  irregular 
data  distributions. 

The  work  presented  here  is  meant  to  be  sufficiently  detailed  to  allow  a  compiler  writer  to  use  it  as  a 
manual  for  implementing  a  similar  compiler.  We  have  validated  our  work  by  implementing  it  in  a  prototype 
compiler,  called  Fx,  for  the  iWarp  system  [2,  3],  The  Fx  compiler  takes  as  input  Fortran  77  statements 
augmented  with  whole-array  and  array-slice  syntax  and  directives  for  data  distribution,  and  produces  as 
output  Fortran  77  code  augmented  with  communication  primitives,  which  is  passed  on  to  the  native  iWarp 
Fortran  compiler. 

The  paper  is  organized  in  the  following  manner:  Section  2  gives  an  overview  of  the  approach  taken  in  the 
paper.  Section  3  presents  the  permanent  low-level  assumptions  we  make,  while  Section  4  imposes  temporary 
restrictions  on  the  form  of  statements  we  compile,  to  simplify  the  analysis.  These  restrictions  are  removed 
in  Section  10.  Section  5  describes  slice  intersection,  which  is  the  basis  of  the  methods  presented  in  Sections 
6  and  7,  which  describe  how  to  compute  the  data  that  is  sent  and  received,  respectively,  between  processors. 
Slice  intersection  is  also  vital  for  Section  8,  which  details  how  a  processor  performs  its  computation  after 
completing  its  communication.  The  communication  and  computation  phases  are  put  together  into  an  overall 
algorithm  in  Section  9.  Section  1 1  describes  how  processors  can  determine  data  ownership,  given  data 
distribution  directives  provided  by  the  user.  Section  12  presents  some  possible  optimizations  for  common 
cases.  Section  1 3  shows  how  the  basic  array  assignment  statement  can  be  easily  extended  to  handle  other 
useful  array  constructs. 

Note  that  Appendix  D  provides  a  partial  glossary  of  the  variable  naming  notation  used  throughout  the 
paper. 

2  Overview  of  the  approach 

In  our  approach,  parallelism  is  expressed  as  user  directives,  which  are  given  at  the  statement  level  in  the 
form  of  array  statements.  The  standard  example  of  an  array  statement  used  in  this  paper  is  the  assignment 
statement: 

A (Za  ■  l.A  ■  s.a)  =  P(% (/fi  :  Ib  '•  ■‘’s ))•  (1) 

which  is  equivalent  to  the  following  sequential  Fortran  code: 


*b  =  /b 

DO  —  fj\ ,  l  A  i  &  A 

T  (m)  =  B(ifl) 

1 B  =  ifl  +  $B 

END  DO 

DO  ij 4  =  /a'Iat&A 
A(m)  =  -F(T(m)) 
END  DO 


This  sequential  code  copies  the  relevant  section  of  B  into  a  temporary  array  T,  and  then  performs  the 
assignment  to  A  based  on  the  contents  of  T.  This  two-step  procedure  is  used  because  the  semantics  of  the 
assignment  statement  dictate  that  the  right-hand  side  terms  are  first  read,  and  then  the  results  are  computed 
and  the  left-hand  side  terms  are  written.  The  procedure  is  depicted  in  Figure  1 . 


Figure  1 :  The  two-step  procedure  for  evaluating  the  basic  array  assignment  statement. 

In  our  approach,  when  we  perform  the  translation  for  distributed  arrays  on  a  multiprocessor,  the  array 
T  will  be  aligned  with  A,  in  the  sense  that  corresponding  elements  of  A  and  T  will  always  be  found  on  the 
same  processor.  This  implies  that  the  relevant  section  of  array  B  is  temporarily  realigned  with  the  relevant 
section  of  A  to  compute  the  result. 

The  array  expression  A(/  :  /  :  s)  contains,  as  its  only  array  index,  an  array  slice.  This  notation  is 
identical  to  the  Fortran  90  subscript  triplet.  An  array  slice  contains  three  colon-separated  components,  and 
represents  an  arithmetic  sequence  of  integers  (i.e.,  the  difference  between  any  two  successive  elements  of 
the  sequence  is  constant).  This  sequence  is  given  by: 

/  +  5,  /  +  2 5, . .  .  ,  /  + 

This  expression  represents  the  sequence  of  integers  starting  at  /,  with  stride  s,  and  bounded  above  by  / 
(bounded  below  by  /  when  s  <  0).  Notice  that  /  is  not  necessarily  a  member  of  the  sequence,  as  in  the  slice 
(1  :  10:2),  which  contains  only  odd  integers. 

The  elements  of  arrays  A  and  B  are  distributed  across  the  processors  in  block-cyclic  fashion.  This 
method  of  distribution  is  pictured  in  Figure  2.  The  distribution  could  be  specified  by  the  user,  or  by  a 
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Figure  2:  An  example  of  a  block-cyclic  distribution.  The  shaded  elements  reside  on  one  particular  processor. 
The  processor  owns  array  elements  corresponding  to  indices  beginning  with  /,  in  blocks  of  size  n.  The 
distance  between  the  start  of  one  block  and  the  start  of  the  next  is  s. 

separate  compiler  phase;  in  any  case,  we  assume  that  the  distribution  has  already  been  determined  when  our 
analysis  is  applied.  For  each  processor,  there  is  a  set  of  indices  denoting  the  array  elements  that  reside  on 
the  processor.  Because  of  the  block-cyclic  distribution,  this  set  can  be  specified  using  only  four  parameters. 
We  say  that  the  processor  “owns”  the  array  elements  corresponding  to  this  index  set. 

We  assume  that  scalars  are  owned  by  all  processors.  Therefore,  every  processor  allocates  storage  for 
each  scalar,  and  at  any  given  point  in  the  program  execution,  all  processors  have  the  same  value  for  each 
scalar. 

For  the  work  presented  in  this  paper,  we  assume  that  parallelism  is  driven  by  the  owner-computes 
rule,  which  states  that  computation  is  performed  on  the  processor  on  which  the  result  is  to  be  stored. 
Note  that  while  the  owner-computes  rule  is  reasonably  simple  to  implement  and  likely  to  be  optimal  for 
many  statements,  it  is  not  the  only  model  on  which  to  base  compilation  algorithms.  On  a  private  memory 
architecture,  use  of  the  owner-computes  rule  implies  that  the  compiler  must  generate  code  to  perform  three 
steps  for  each  processor  P: 

Send:  For  every  processor  Q,  determine  which  array  elements  are  owned  by  P  and  needed  for  Q’s 
computation,  and  send  these  elements  from  P  to  Q. 

Receive:  For  every  processor  Q,  determine  which  array  elements  are  owned  by  Q  and  needed  for  P’ s 
computation,  and  receive  these  elements  on  P  from  Q. 

Compute:  For  every  element  on  the  left-hand  side  of  the  equation  owned  by  P,  perform  the  computation 
and  store  the  result  locally.  After  having  executed  the  previous  two  steps,  all  right-hand  side  data  is 
guaranteed  to  be  in  local  memory  on  P. 

The  difficult  parts  of  these  steps  are: 

•  determining  which  array  elements  to  send,  and  where  in  local  memory  these  array  elements  can  be 
found; 

•  determining  where  in  local  memory  to  store  received  elements; 

•  determining  which  elements  to  nerform  the  computation  on  after  receiving  from  all  other  processors. 

Our  approach  is  to  devise  a  general  method  to  handle  all  cases,  and  then  to  identify  and  optimize  the  common 
cases.  The  basis  of  our  general  method  is  to  find  intersections  of  slices,  which  can  be  used  both  in  array 
expressions  and  to  characterize  block-cyclic  ownership  sets.  Since  a  slice  can  be  characterized  by  three 
parameters,  and  one  additional  parameter  can  be  used  to  characterize  an  ownership  set  in  terms  of  slices, 
the  resulting  intersections  can  also  be  specified  using  a  fairly  small  number  of  parameters. 
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When  mapping  a  single  source  program  to  multiple  processors,  a  compiler  can  take  the  approach  of 
either  creating  a  single  program  to  be  run  by  all  processors,  or  creating  several  different  programs  that  are 
mapped  onto  the  set  of  processors.  In  our  model,  parallelism  is  derived  only  at  the  statement  (data-parallel) 
level,  and  not  at  the  instruction-parallel  level;  that  is,  every  processor  is  assumed  to  be  following  the  same 
thread  of  control.  For  this  reason,  all  processors  execute  similar  instruction  streams.  Hence  the  analysis 
presented  in  this  paper  is  designed  to  allow  a  compiler  to  produce  a  single  program  as  output,  which  is 
meant  to  be  executed  by  all  processors,  in  a  Single  Program  Multiple  Data  (SPMD)  model. 

For  the  remainder  of  the  paper,  we  proceed  using  a  bottom-up  approach.  We  begin  by  describing  how 
to  compile  the  simple  array  assignment  statement  given  in  equation  (1).  This  example  captures  the  essence 
of  the  complexity  of  compiling  array  assignment  statements.  Then  we  extend  the  framework  to  handle 
arbitrary  statements,  and  we  describe  possible  optimizations.  In  describing  the  compilation  of  equation  (1 ), 
we  reduce  the  communication  requirements  to  that  of  a  single  send  (and  the  corresponding  receive)  between 
an  arbitrary  pair  of  processors.  Section  6  describes  how  the  sending  processor  determines  which  data  to 
send,  while  Section  7  describes  how  the  receiver  processes  and  “decodes”  the  received  data. 

3  Basic  requirements 

There  are  certain  basic  requirements  that  we  impose  on  the  form  of  the  program  being  compiled,  and  on  the 
machine  for  which  the  program  is  compiled.  This  section  describes  and  motivates  these  requirements. 

We  assume  the  existence  of  a  general  message  passing  system.  This  message  passing  system  should 
be  capable  of  supporting  messages  sent  from  one  processor  to  itself,  since  the  algorithms  presented  here 
make  no  distinction  between  sends  to  oneself  and  sends  to  a  different  processor.  Messages  received  within 
one  phase  are  allowed  to  arrive  in  any  order,  they  can  be  processed  as  soon  as  they  arrive.  However,  to 
process  an  incoming  message  correctly,  the  receiving  processor  must  have  a  means  of  determining  which 
processor  sent  the  message.  If  the  message  passing  system  does  not  provide  this  functionality,  the  compiler 
must  generate  code  to  provide  it  (e.g.,  by  having  the  sender  provide  its  identifier  along  with  the  message). 

All  distributions  of  arrays  must  be  block-cyclic  (note  that  block  and  cyclic  distributions  are  special  cases 
of  block-cyclic  distributions).  The  primary  reason  for  this  choice  is  that  the  block-cyclic  method  prov  ides 
a  large  number  of  distributions,  while  requiring  a  small,  fixed  number  of  parameters  to  describe.  Another 
reason,  as  shown  later,  is  that  block-cyclic  distributions  can  be  described  as  unions  of  slices,  as  in  equations 
(3)  and  (4),  and  slices  are  closed  under  intersection.  Thus  all  sets  over  which  we  take  intersections  are 
described  in  terms  of  slices,  as  are  the  intersections. 

For  a  particular  array,  the  block  size  must  be  the  same  for  each  processor  over  which  the  array  is 
distributed.  In  a  heterogeneous  multiprocessing  environment,  it  might  sometimes  be  worthwhile  to  assign 
different  proportions  of  the  array  to  different  processors  to  account  for  differences  in  processing  speeds. 
However,  that  kind  of  application  is  beyond  the  scope  of  this  paper.  In  addition,  if  block  sizes  are  allowed 
to  be  different,  certain  optimizations,  such  as  the  one  presented  in  Section  1 2. 1 . 1 .  cannot  be  performed,  and 
certain  computations  must  move  inside  an  additional  loop. 

We  use  the  owner-computes  rule  to  divide  the  computation  among  the  processors  and  to  determine 
the  communication.  This  assumption  forms  the  basis  of  all  the  communication  and  computation  analysis 
presented  here. 

Although  array  ownership  is  usually  disjoint  (i.e.,  each  array  element  is  owned  by  only  one  processor), 
all  scalars  are  replicated.  We  wish  the  program  execution  to  be  deterministic  and  to  follow  the  semantics 
of  an  equivalent  sequential  program.  Thus  we  can  think  of  the  parallel  execution  as  having  a  single  thread 
of  control.  At  any  point  in  this  thread  of  control,  all  processors  have  the  same  local  value  of  a  replicated 
variable.  This  restriction  is  necessary  to  preserve  the  sequential  semantics  of  the  program. 
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The  final  requirement  involves  the  modulo,  or  “mod”,  operator.  Throughout  this  paper  we  will  use 
expressions  in  the  form  a  mod  6,  where  6  will  always  be  positive.  However,  a  will  sometimes  be  negative 
in  this  context.  We  use  the  mathematical  definition  of  the  modulo  operator,  in  which  a  mod  b  is  defined  to 
return  a  value  between  0  and  6—1,  inclusive.  Note  that  in  the  C  programming  language,  the  sign  of  a9cb  is 
implementation-dependent  when  either  a  or  6  is  negative  [11]. 

4  Simplifying  restrictions 

As  our  typical  array  statement,  we  have  chosen  the  assignment  statement  given  in  equation  ( 1 ).  However, 
this  statement  is  far  from  typical  in  most  programs.  For  example,  assignment  statements  could  involve 
multiple  terms  on  the  right-hand  side,  arrays  with  more  than  one  dimension,  and  scalar  subscripts  as  well  as 
slices. 

For  now,  we  will  make  a  number  of  additional  assumptions  concerning  the  form  of  the  array  statements, 
all  of  which  will  be  relaxed  in  Section  10.  The  assumptions  are  the  following: 

•  We  are  considering  only  array  assignment  statements.  Most  other  array  constructs  are  simple  exten¬ 
sions  of  the  array  assignment  statement.  The  implementation  of  more  array  constructs  is  described  in 
Section  13. 

•  Each  array  has  only  one  dimension.  For  multidimensional  array  assignments,  the  analysis  extends 
orthogonally  for  each  dimension  of  the  array,  as  we  will  show  later. 

•  In  each  assignment  statement,  there  is  only  one  term  on  the  right-hand  side.  When  this  restriction  is 
relaxed,  the  analysis  also  extends  orthogonally  for  each  additional  term  on  the  right-hand  side. 

•  For  every  slice,  the  stride  component  (.s)  is  positive.  One  reason  for  this  restriction  involves  Euclid's 
algorithm  for  computing  the  greatest  common  divisor  of  two  integers,  which  requires  nonnegative 
inputs.  A  more  important  reason  is  that  the  algorithms  that  will  be  presented  assume  that  /  is  a  lower 
bound  on  the  sequence  and  /  is  an  upper  bound.  This  condition  does  not  hold  when  the  stride  is 
negative,  as  in  the  slice  ( 10  :  1  :  - 1 ),  where  /  is  an  upper  bound  and  /  is  a  lower  bound. 

•  Nested  array  references  are  not  allowed.  An  example  of  a  nested  array  reference  is  A(B(  1))  or 
A(B(  1  :  n)),  where  A  and  B  are  arrays.  This  paper  will  not  consider  ways  to  efficiently  evaluate  such 
nested  array  references,  since  in  this  case  there  is  no  longer  a  way,  in  general,  to  characterize  the 
sequence  of  array  elements  using  a  small  set  of  parameters. 


5  Slice  intersection 

When  deriving  the  algorithms  below,  it  will  be  necessary  to  find  intersections  of  slices.  A  slice  parameterizes 
an  arithmetic  sequence  of  integers,  and  we  need  to  find  the  intersection  of  two  or  more  such  slices,  while 
preserving  the  ordering.  As  we  show  below,  such  an  intersection  either  is  empty  or  is  another  slice;  thus  slices 
are  closed  under  intersection.  Our  goal  is  to  find  the  first ,  last,  and  stride  components  of  the  intersection. 
Figure  3  shows  two  examples  of  slice  intersections. 

Our  approach  for  finding  the  intersection  of  two  slices  has  three  conceptual  steps.  The  first  step  is 
to  extend  the  lower  bound  and  the  upper  bound  of  each  sequence  to  -oc  and  +oc,  respectively,  while 
remembering  the  original  bounds.  The  second  step  is  to  find  the  intersection  of  these  infinite  sequences, 
which,  as  shown  below,  will  be  either  empty  or  another  infinite  sequence.  The  final  step  is  to  find  the 
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(1:48:4)  ■MBMMllBllMlIB.llBMB  I'Tl  I  1  ■  1  1  ■  1  I  ■  TT1 

(3:48:6)  I  1  ■  I  I  1  I  ■  1T1  I  ■  1  M  T  Ml  I  I  I  ■  11  I  I  ■  I  I  I  I  M3  1  1  I  TT1 

(9:48:12)  I  II  I  I  I  I  EM  M  I  I  M  II  M  ■  I  I  11  I  I  1  I  11  ■  I  I  1  M  II  I  ll  TUI 


(1:48:4)  ■  I  I  ■  I  UUH . EJEU3  I  ■  I  1  ~MT~nB'TTWl~T''»IEI  WIT"! 

(4:48:6)  I  I  I  ■  I  I  II  III  I  E1ECT  I  I  I  I  ■  II  IT  ■  I  TH"  1TT1  TTI 

[empty]  IT  I  I'lTTT  I  HD  EH  I  I  I  I  I  !  I  I  II  I  I  ITTTT  I  I  I  I  II  I  M  I  I  I  I  I  I  I 

Figure  3:  Two  examples  of  slice  intersection.  In  the  first  example,  we  see  that  the  intersection  ( l  :  48  : 
4)  n  (3  :  48  :  6)  =  (9  :  48  :  12).  In  the  second  example,  the  intersection  (1  :  48  :  4)  n  (4  •  48  :  6)  is  empty: 
the  slices  cannot  possibly  share  elements,  since  the  first  slice  contains  only  odd  integers,  while  the  second 
slice  contains  only  even  integers. 

true  lower  and  upper  bounds  of  the  interse*  lion  by  using  the  remembered  original  bounds.  This  approach 
works  well  for  finding  the  intersections  of  three  or  more  slices,  since  we  can  avoid  the  steps  of  converting 
intermediate  intersections  from  finite  sequences  to  infinite  sequences,  and  vice  versa.  Figure  4  depicts  the 
three  steps  of  this  approach. 

Consider  the  slice  (/  :  /  :  s).  Recall  that  /  represents  the  first  (smallest)  element  of  the  slice,  /  is  an 
upper  bound  on  elements  in  the  set,  and  s  is  the  stride.  Assume  that  *  is  positive;  if*  is  negative,  then  /  is 
the  largest  element  and  /  is  a  lower  bound. 

We  introduce  a  modified  representation  of  a  slice,  (/  :  l  :  $  :  r),  which  represents  a  slice  with  infinite 
bounds,  as  described  above,  where  /  and  /  are  the  original  remembered  bounds.  The  *  parameter  serves  the 
same  purpose  as  before.  The  r  parameter  is  a  “representative"  of  the  set,  in  the  sense  that  any  member  of 
the  set  is  equal  to  t  -ns  for  some  integer  n.  For  example,  (0  :  10  :  2  :  37)  represents  the  same  sequence 
as  ( 1  :  10  :  2),  which  is  also  the  same  as  ( 1  :  9  :  2).  In  this  representation,  the  /  parameter  need  not  be  a 
representative  of  the  set,  in  the  same  sense  that  the  l  parameter  need  not  be  a  representative  of  the  set.  The 
only  requirement  is  that  for  any  integer  n,  r  +  n.s  is  a  member  of  the  original  set  if  and  only  if  /  <  r  +  ns  <  I 
in  the  new  representation. 

The  conversion  from  one  representation  to  the  other  is  simple.  The  sequence  given  by  i  /  :/:.*)  is  the 
same  as  that  given  by  (/:/:*:  /),  and  the  sequence  given  by  (/:/:*:  r )  is  the  same  as  that  given  by 
(/  +  (('•-/)  mod  s)  :  l  :  .?). 

Consider  the  intersection  of  the  slices  S,  =(/,:/,:  *,  :  r. )  and  Sj  =  (f.  :  /  :  :  r,).  A  representative 

of  the  intersection  is  some  integer  r  for  which  there  exist  integers  m  and  n  such  that  r  =  r,  4-  ms,  =  r.  +  ns,. 
This  equation  tells  us  that: 

m. s,  =  Vj  —  r,  ( mod  *, ). 

Let  x,  y,  and  g  be  integers  such  that  xs,  4-  ysj  =  g  -  gcd( ).  These  integers  can  be  computed  using 
Euclid’s  extended  GCD  algorithm  [12,  Volume  2],  A  theorem  of  modular  linear  equations  |7,  Chapter  33] 
tells  us  that  a  solution  exists  if  and  only  if  g\(  r,  -  r, )  [that  is,  g  evenly  divides  ( r.  -  r,  )|,  and  that  a  value 
for  m  is: 

( r,  -  r,)x 
w  -  — - . 

.7 
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Figure  4:  An  illustration  of  the  three  conceptual  steps  involved  in  computing  the  intersection  of  two  slices, 
S\  =  (/i  :  /|  :  si )  and  S2  =  ( fz :  h  '■  «i).  In  step  (a),  we  extend  the  lower  and  upper  bounds  to  -x  and  +  x, 
respectively,  yielding  5J  and  S"2.  In  step  (b),  we  compute  =  S\  n  S'2.  In  step  (c),  we  compute  the  finite 
bounds  /3  and  /3. 

By  substituting: 

(r  -  r,)x.s, 

r  =  r,  4-  m.s',  =  r,  +  — 1 - . 

<7 

It  is  clear  that  the  resulting  stride  of  the  intersection  will  be  lcm(.s,.  ,s7 )  =  As  our  new  upper  and 

lower  bounds,  we  must  take  the  tightest  of  the  two  original  bounds.  Therefore: 

(/,  :  l,  :  Si  :  r,)  C\(fj  :  lj  :  s7  :  r,)  =  (maxf/,./,)  :  min(/,-./j)  :  s,Sj/g  :  r,  +  (r,-  -  rt)xs,/<j).  (2) 

Note  that  the  representative  is  not  necessarily  the  same  as  either  of  the  lower  bounds.  This  observation 
provides  the  motivation  for  using  this  modified  slice  representation.  If  we  are  taking  the  intersections 
of  three  or  more  slices,  as  we  will  in  later  sections,  it  will  be  more  efficient  to  minimize  the  number  of 
conversions  between  representations. 
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6  Sends 


This  section  describes  how  a  processor  determines  what  data  to  send  to  another  processor.  The  data  is 
described  in  terms  of  which  elements  of  the  right-hand  side  array  are  needed  oy  the  receiving  processor 
and  can  be  supplied  by  the  sending  processor.  We  also  describe  how  the  sending  processor  maps  the  global 
array  indices  to  local  memory. 

6.1  Derivation 

Suppose  that  we  have  an  assignment  statement  in  the  form  of  equation  (1).  We  want  to  determine  which 
elements  of  B  should  be  sent  from  processor  S  to  processor  V.  First,  we  need  a  means  of  describing  the 
set  of  array  elements  that  a  particular  processor  owns.  Four  parameters  suffice:  /,  /,  s,  and  n.  The  / 
parameter  is  the  lowest-numbered  index  of  the  array  elements  owned  by  the  processor;  the  l  parameter  is  an 
upper  bound  on  the  highest-numbered  index  of  the  array  elements  owned;  the  n  parameter  is  the  block  size; 
and  the  s  parameter  is  the  offset  between  the  start  of  one  block  and  the  start  of  the  next.  With  these  four 
parameters,  a  block-cyclic  distribution  can  be  described  as  the  disjoint  union  of  slices  Ut=o (f  +  k\l:s), 
as  pictured  in  Figure  5. 

( f+0:l:s ) 

(/+  1 :  j) 

C/+2:/:i) 

(f+i:l:s ) 

n  -  I 

U  ( f+k:l:s ) 

4*0 


Figure  5:  A  block-cyclic  distribution  as  a  union  of  disjoint  slices. 

Since  A  and  B  have  block-cyclic  distributions,  we  are  given  integers  fp,  Ip,  np,  sp ,  fs.  Is,  ns.  and 
such  that  the  portion  of  A  that  processor  V  owns  is  characterized  by: 

nv  ~  1 

Ovonp(k)  -  [J  (fp  +  j' '■  Ip '■  $p)  (3) 

j'= o 

and  the  portion  of  B  that  processor  S  owns  is  characterized  by: 

«5-l 

Owns(B)  =  (J  (fs  +  i' :  Is  '■  Ss)-  (4) 

i'=0 

Note  that  the  names  of  processors  S  and  V  do  not  appear  in  the  following  analysis;  they  are  implicitly 
encoded  in  the  parameters  fp  and  fs-  This  encoding  is  described  in  Section  1 1.  Also  note  that  each 
ownership  set  is  the  union  of  disjoint  sets,  provided  that  np  <  sp  and  tis  <  *s- 

The  basic  method  for  determining  the  communication  sets  is  to  compute  unions  and  intersections  of 
regular  sets  (i.e.,  slices),  which  can  be  characterized  with  3  parameters.  Although  slices  are  closed  under 
intersection,  they  are  not  closed  under  union;  hence  unions  of  slices,  and  intersections  of  unions  of  slices, 
require  more  than  3  parameters  to  represent,  and  require  more  complexity  than  simple  slices  to  enumerate. 
Figure  6  gives  an  example  of  the  communication  that  might  arise  between  processors  during  the  evaluation 
of  the  assignment  statement  A(  1  :  n  :  1 )  =  B(  1  :  n  :  1 ).  Each  array  is  distributed  over  two  processors, 
where  A  has  a  block  size  of  3  and  B  has  a  block  size  of  5.  The  shaded  boxes  represent  the  elements  of  each 
array  that  are  owned  by  the  corresponding  processor.  The  arrows  represent  the  array  elements  that  processor 
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s  must  send  and  that  processors  d\  and  d2  must  receive;  hence  the  arrows  correspond  to  the  intersection  of 
the  block-cyclic  ownership  sets.  Each  set  of  arrows  represents  the  intersection  of  the  corresponding  unions 
of  slices;  note  that  it  is  complex  to  represent  the  intersection,  even  though  the  ownership  sets  have  simple 
representations  as  unions  of  slices.  The  complexity  increases  when  we  use  non-unit  stride  components  in  the 
assignment  statement.  Note  also  in  this  example  that  there  is  little  similarity  between  the  two  intersections 
(one  intersection  is  represented  by  upward  arrows,  the  other  by  downward  arrows). 


Figure  6:  An  example  of  the  potential  complexity  of  transferring  data  from  one  processor  s. 

The  set  of  indices  corresponding  to  the  array  elements  of  A  owned  by  processor  D  that  are  used  in  the 
computation  is  given  by  S\  =  Own-pi  A)  D  if  a  :  l  a  :  s.4).  The  set  of  indices  corresponding  to  the  array 
elements  ofB  owned  by  processor  5  that  are  used  in  the  computation  is  given  by  52  =  Owns(B)C\(fB  :  Ib  ■ 
sb)-  Processor  S  sends  the  array  element  corresponding  to  the  ith  member  of  the  sequence  ( fs  :  Ib  ■  *b) 
if  and  only  if  the  ith  member  of  the  sequence  ( /a  :  1  b  :  sb  )  is  in  S2  and  the  /th  member  of  the  sequence 
if  A  •'  l  A  '  >s.4 )  is  in  5|.  Suppose  index  x  is  member  i  of  the  sequence  (/a  :  Ib  '■  $B ),  where  i  =  0  represents 
the  first  member  of  the  sequence.  Then,  i  =  Now  member  i  of  the  sequence  if  a  :  I  a  :  *4  )  is 

fA  +  is  a,  which  is  equal  to  ~^-sa  +  f a-  Thus  processor  S  sends  the  array  element  corresponding  to 
index  x  if  and  only  if  a:  €  S2  and  ~{^-sa  +  /.4  €  S\.  Equivalently,  processor  5  sends  the  array  element 

corresponding  to  index  +  /s  if  and  only  if  y  €  5i  and  +  /a  €  S2- 

The  above  observation  motivates  the  definition  of  a  Map  function,  which  maps  a  member  of  the 
sequence  if  a  :  l  a  '■  -s.4 )  to  the  corresponding  member  of  the  sequence  (/a  :  Ib  ■  *b)-  h  >s  defined  on  an 
index  y  as: 

Mapiy)  =  - — —  ■  sB  +  /a-  ( 5 ) 

•5.4 

and  the  Map  function  is  defined  on  a  set  or  sequence  in  the  obvious  fashion.  The  purpose  of  this  Map 
function  is  to  associate  A  if  a)  with  B(/a)  =  B(A/ap(/.4)),  to  associate  A  (fA  +  *4)  with  B(/a  +  -<a )  = 
B(  Mapi  fA  +  5.4 ) ),  to  associate  A(  fA  +  2s  .4 )  with  B{  /a  +  2s  b  )  =  B(  Map(  f,  4  4-  2s  4 ) ),  etc.  Then  processor 
S  sends  the  array  element  corresponding  to  index  Mapiy)  if  and  only  if  y  €  Si  and  Mapiy)  e  S2.  Thus 
the  set  of  elements  of  B  that  processor  «S  sends  to  processor  V  corresponds  to  the  following  set  of  indices: 

$  =  52  H  MapiS  1)  =  iOivnsi  B)  n  (/a  :  /a  :  ) )  n  MapiOwnpik)  n  {fA  :  I  a  ■  .-%))- 

But  since  MapifA  ■  l a  ’■  *a)  =  (J B  ■  h  ■  *b)<  we  can  remove  the  (/a  :  Ib  ■  -“a)  term.  Note  that  we 
cannot  remove  the  ( /.4  :  l a  ■  *a  )  term,  since  the  Map  function  is  only  defined  on  elements  in  the  set 
(fA  :  l  a  :  S4).  Thus  the  intersection  to  compute  is  the  following: 

=  Own-si B)D  Map(Ownp(A)n  (fA  :  l a  •  sa))-  (6) 

In  the  following  derivation,  we  ignore  the  upper  bounds  of  the  slices,  because  in  the  definition  of  the 
Ownp(k)  set  in  equation  (3),  the  upper  bound  of  each  slice  is  independent  of  the  value  of  /.  The  same 
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holds  true  for  Owns(B)  in  equation  (4).  This  independence  allows  us  to  delay  the  derivation  of  the  upper 
bound  of  the  intersection  until  the  end. 

Our  basic  algorithm  for  computing  'S'  takes  the  following  form: 

#  =  {} 

DO  f  =  0  UPTO  np  -  1 
DO  i'  =  0  UPTO  ns  -  1 

¥  =  #  U  ((fs  +  i' :  Is  ■  ss)  n  Map((fv  +  j' :  Ip  :  sp)  n  (fA  :  lA  :  sA))) 

END  DO 
END  IX) 

Note  that  some  of  the  intersections  will  be  empty,  regardless  of  the  upper  bounds  of  the  slices.  Our  algorithm 
takes  this  possibility  into  account,  and  only  iterates  over  values  of  i'  and  j1  for  which  the  intersection  is 
nonempty  (assuming  large  upper  bounds).  Restricting  the  iteration  set  is  accomplished  by  transforming  the 
i'  and  j'  indices  into  i  and  j ,  with  index  bounds  chosen  to  avoid  empty  intersections. 

To  compute  <?,  we  first  compute  the  following  intermediate  set,  which  is  a  portion  of  equation  (6): 

n-p  —  1 

Ownp(k)n(fA::  sA)  =  (J  {(fp+j'::sp)C\(fA  ::s4)}. 

j'= o 

This  intermediate  set  is  computed  using  the  methods  described  in  Section  5.  We  use  Euclid’s  algorithm 
to  find  integers  a?i,  y\,  and  g\  such  that  +  y\sp  =  g\  =  gcd(s.4,.st>).  We  define  a  function  72/  which 
extracts  a  set  of  representative  from  an  input  set,  using  the  concept  of  a  representative  of  a  slice  presented 
in  Section  5.  The  subscript  I  is  a  list  of  indices.  For  each  valid  instantiation  of  /,  there  is  one  associated 
representative.  For  example,  TZi ,},  I  <  t  <  10,  1  <  j  <  10,  produces  a  set  of  100  representative  elements. 
Representatives  of  the  intermediate  set,  according  to  equation  (2),  are: 

/  *  \  /  r  \  \  r  ,  Uv  +  j'  -  Ia)X\SA 

TZj'(Ownp(k )  n  (fA  : :  sA ))  =  fA  + - 

for  the  values  of  0  <  j’  <  np  -  1  where  g\ \(fp  +  j'  -  fA ).  Where  g\  does  not  divide  fp  +  j'  -  f  A,  the 
corresponding  slice  intersection  is  empty  and  there  is  no  representative.  The  resulting  stride  is  for  all 

/• 

We  can  introduce  a  new  integer  variable  j  where  g\j  =  fp  +  j'  -  f,\,  giving  us  j'  =  g\j  +  fA  -  fp. 
Since  0  <  j'  <  np  -  1 ,  we  have: 

' fy  -  /a'  <  •  <  I  /p  -  /a  +  np  -  1 
g\  ~  L  0\ 

This  change  of  variables  allows  us  to  transform  our  intermediate  set  representatives  to  the  following: 


TZj(Ownp(k)  n  (/. 4  : :  5.4))  =  fA  +jx\sA 


fv  ~  Ia  ^  ^  fp  -  /a  +  np  -  \ 

-  <  j  <  - - - 

9 1  L  9\ 


(7) 

(8) 


Note  that  for  some  possible  values  of  the  parameters,  the  lower  bound  of  j  may  in  fact  exceed  the  upper 
bound,  in  which  case  there  are  no  legitimate  values  of  j  and  hence  no  representatives.  We  apply  the  Map 
function  to  equation  (7)  to  get  us  one  step  closer  to  and  find  that  a  representative  is: 


II 


(9) 


Kj(Map(0wnv( A)  n  (fA  : :  a*)))  =  fB  +  jx\sB 
\fv  ~  Sa  1  .  .  .  I  h  ~  /a  +  n-p  -  1  I 


>—41  <  j< \i 

g\  L 


and  the  new  stride  is  for  all  j. 

Si  j 

Now  we  intersect  equation  (9)  with  <3ums(B)  to  yield  a  representative  of  equation  (6): 

#>,«'(*)  =  (JjB  +  jx\SB  ■ :  n  (fs  +  i' ; : 

rfejjui  . < | h sJa 1  -0<i-<  , 

I  5i  I  L  5i  J 

In  the  same  manner  as  before,  we  run  Euclid’s  algorithm  on  and  s$  to  yield  integers  ii,  y2, 
and  52  such  that  12  •  +  Vi&s  =  52  =  gcd(i^E-,ss).  Since  the  intersection  is  nonempty  only  when 

52K/5  +  i'  -  Ib  -  jxisB),  we  define  as  before  an  integer  i  such  that  g2i  =  fs  +  i'  -  fB  -  jx\sB ,  or 
equivalently,  i'  =  521  -  /$  +  /b  +  jx\sB.  Since  0  <  i'  <  ns  —  1,  we  find  that: 


"fs  ~  Ib  -  jx\»B 


<  i  < 


fs  -  Zb  -  jx\sB  +  ns  -  1 


Once  again,  note  that  the  lower  bound  of  i  could  exceed  the  upper  bound,  in  which  case  there  are  no 
representatives  for  that  particular  value  of  j.  Thus  representatives  of  the  final  result  $  are  given  by: 


ftj.iW  =  /s  +  jXlSB  + 


lX2SBSx> 


<j<  \’x -4± -p - 1  (ii) 

5i  I  L  5i 

’fs  ~  fB  -  jx\sBl  ^  ^  I  fs  ~  fB  ~  jx{SB  +  ns  -  1  p 

52  I  “  L  52  J 

The  resulting  stride  is 

Now  that  we  have  the  representatives,  we  can  find  the  corresponding  lower  bounds  of  the  slices.  Let  c 
be  the  tightest  lower  bound,  given  the  original  lower  bounds;  it  is  defined  as: 


c  =  max 


{Map(fA),Map(fv)Js}  =  ma x{/s.  •  sb  +  / b-Is }• 


Then  as  shown  in  Section  5,  the  true  lower  bound  of  the  slice  is  given  as: 

.  /.t.x  >  .SBSV*s\  ,  ,  ■  .  >X2*B*V  .  ,  *B»P*S 

c  +  (ftj.iW  -  c)  mod -  =  c  +  (/e  +  jx\sB  + - c)  mod - 

V  5i52  /  V  5i  5i52 

Finally,  we  need  to  find  the  upper  bound  of  the  slices: 

min{ Map( l a),  Map(b)Js}  =  min  Qm‘n^  — —  •  xg  +■  fB.ls^j  ■ 


12 


6.2  Algorithm 

Given  the  assignment  statement  from  equation  (1),  where  processor  S  owns  elements  of  B  characterized  by 
fs.  Is,  s s,  and  ns  according  to  equation  (4),  and  processor  D  owns  elements  of  A  characterized  by  fp,  Ip, 
sp,  and  np  according  to  equation  (3),  we  compute  the  set  of  elements  of  B  that  processor  S  should  send  to 
processor  V.  Figure  7  gives  an  algorithm,  Global-Index-Send,  to  determine  this  set.  It  is  derived  from 
equations  (10),  (1 1),  (12),  (13),  (14),  and  (15). 


(xi  ,y\,g\)  =  euclid(sA, sp) 

(X2,V2,92)  =  eucl  id(sBsp/gi,ss) 
stride  =  sBspss/(gi92) 

last  =  min(/s  +  sB[(min(lA,lp)  -  Ia)/sa\,Is)  (15) 

c  =  max(/B  +  sB\(max(fA,fp)  -  fA)/sA],fs)  (13) 

DO  j  =  \(fp  -  fA)/gi)  UPTO  [(fp  -fA  +  nv-\  )/<?, J  (11) 

DO  i  =  \(fs  ~  fB  ~  jx\sB)lg{\  UPTO  [(fs  -  fB  -  jx{sB  +  ns  -  1  )/go\  (12) 

r  =  fB  +  jx\SB  +  ix2SBsp/g\  (10) 

f  irst  =  c  +  ((r  -  c)  mod  stride)  (14) 

output-slice(f  irst,  last,  stride) 


Figure  7:  Global-Index-Send  algorithm.  The  numbers  in  the  right-most  column  refer  to  the  equation  from 
which  the  corresponding  line  was  derived. 


In  practice,  when  computing  the  value  of  the  representative  r,  we  can  sometimes  get  a  value  that  is  not 
expressible  in  32  bits.  This  potential  overflow  is  due  to  the  term  ixzsBsp/g i,  and  the  fact  that  i  and  xi  can 
have  quite  large  magnitudes.  In  this  case,  we  can  make  use  of  the  fact  that  we  can  add  or  subtract  multiples 
of  the  stride  and  still  have  a  valid  representative.  For  example,  we  can  replace  the  term  ix2SB.sp  jg\  with 
2^z([r2(i  mod  !*)]  mod  ^). 

3l  u  91  u  91  ’ 


6 3  Local  memory 

While  the  analysis  above  determines  which  global  indices  of  the  array  elements  should  be  sent,  we  still  need 
to  map  these  global  indices  to  the  processor’s  local  memory.  Here  we  give  a  mapping  function. 

One  possible  mapping  is  simply  the  identity  mapping,  where  the  global  indices  and  the  local  indices 
are  the  same.  However,  the  identity  mapping  is  quite  wasteful  of  memory,  and  one  goal  of  parallelizing 
applications  is  to  allow  problems  with  larger  memory  requirements  to  be  solved. 

To  minimize  the  amount  of  wasted  memory  space,  we  will  want  to  compact  the  array  (i.e.,  place  the 
blocks  contiguously  in  memory).  Hiranandani  et  al.  [10]  argue  that  this  compaction  is  actually  necessary, 
and  should  not  be  viewed  merely  as  an  optimization.  Given  a  global  index  x  and  a  block-cyclic  distribution 
characterized  as  above  in  terms  of  /,,  ra„  and  -s,,  we  find  that  the  local  index  LM,(x)  is  defined  as: 


block 


LMi(x)  =  - - — 

L  ^ 


offset 


•Hi  +  ((x  -  /,)  mod  5j )  +K 


(16) 


where  K  is  the  index  of  the  first  array  element.  In  theC  programming  language,  K  is  0;  in  Fortran  K  is  usually 
1 .  This  mapping  is  illustrated  in  Figure  8. 

Note  that  equation  (16)  is  only  defined  on  elements  owned  by  the  processor,  Ul'=o '  ( ft  +  A-  :  /,  :  ). 

However,  we  will  wish  to  apply  the  function  to  last  (one  of  the  parameters  defined  in  Figure  7),  which 


13 


LM_A 


Figure  8:  An  example  of  the  local  memory  mapping  function  for  a  particular  processor.  The  processor  owns 
array  elements  of  A  with  global  indices  beginning  with  /  =  5,  with  block  size  n  =  4  and  stride  s  =  12.  These 
global  indices  are  mapped  to  contiguous  indices  in  local  memory. 

is  not  necessarily  an  element  owned  by  the  processor.  In  this  case,  we  must  simply  ensure  that  the  offset 
portion  of  the  equation  is  less  than  n,,  by  the  use  of  the  min  function. 

It  is  easy  to  see  that  £Af,-(x  +  s* )  =  LMfx  )  +  n,.  This  property  is  useful  when  doing  the  local  memory 
mapping  for  a  slice,  where  the  stride  component  of  the  slice  is  a  multiple  of  st.  In  this  case,  we  need  only 
evaluate  the  function  for  the  first  and  the  last  component  of  the  slice,  and  not  for  every  element  of  the  slice. 

Another  useful  property  of  the  LM,  function  can  be  used  in  doing  the  local  memory  mapping  for  the 
GLOBAL-lNDEX-SEND  algorithm.  We  need  to  compute  L\l( first),  which  can  be  done  more  efficiently 
using  the  property  that  the  offset  portion  of  LM( first)  is  (first  -  fs)  mod  =  if  where  i'  = 
-  fs  +  Ib  +  j x  1 SB • 

These  observations  allow  us  to  formulate  an  algorithm  for  computing  the  local  indices  of  array  elements 
to  send,  Local-Index-Send,  shown  in  Figure  9. 


(xi,yi,gi)  =  auclid(s.4,sp) 

(*2iJ/2,02)  =  euclid(ss^v/ ffi,  ■ss ) 

stride  =  sBsvssl{9\92 ) 

lmstride  =  SBSpnsHgtgi) 

last  =  min(/s  +  sB[(m\n(lAJv)  -  Ia)/sa\Js) 

lmlast  =  [(last  -  fs)/ss\  nS  +  min((last  -  fs )  mod  $5.  ns  -  1 )  +  K 

c  =  max(/B  +  sB f(max(/4,/p)  -  /a)/«a1./s) 

DO  j  =\(fv~  f a)/9\ 1  UPTO  [(fv  -  f.A  +  nv  -  1  )/gx\ 

DO  i  =  [(fs  -  Sb  ~  jx\$B)/92]  UPTO  [(fs  ~  fB  ~  jx\*B  +  »S  ~  1  )/<7:J 
i'  =  921  -  fs  +  fB  +jX]SB 

r  =  fB  +  jx\sB  +  ix2SB*v/g\ 

first  =  c  +  ((r  —  c)  mod  stride) 

lmf irst  =  (first  -  fs  -  i')ns/*s  +  i'  +  K 

output-local-slice(  lmf  irst,  lmlast,  lmstride) 


Figure  9:  Lcx:al-Index-Send  algorithm,  derived  from  applying  the  LM  function  to  the  Global-Index- 
Send  algorithm. 
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7  Receives 


This  section  describes  how  a  processor  determines  what  data  to  receive  from  another  processor.  This 
calculation  essentially  involves  determining  the  order  in  which  the  sending  processor  inserted  array  elements 
into  the  buffer.  As  in  Section  6,  the  data  is  described  in  terms  of  which  elements  of  the  right-hand  side  array 
are  needed  by  the  receiving  processor  and  can  be  supplied  by  the  sending  processor.  We  also  describe  how 
the  receiving  processor  maps  the  global  array  indices  to  local  memory. 

7.1  Algorithm 

As  shown  in  Figure  1,  the  first  step  in  the  evaluation  of  the  array  assignment  statement  is  to  copy  B(  fg  : 
lg  :  sg)  into  T(/a  '■  'a  -  $a)-  This  copy  step  can  involve  interprocessor  communication.  We  choose  to 
copy  into  this  particular  section  of  T  so  that  during  the  second  evaluation  step,  in  which  we  apply  T  to  T  and 
assign  the  results  to  A,  we  can  use  the  same  array  index  to  access  both  A  and  T.  (The  details  of  this  second 
step  are  described  in  Section  8.) 

To  process  a  received  message,  we  must  copy  each  element  from  the  receive  buffer  into  some  location 
in  T.  The  mapping  of  elements  from  the  receive  buffer  into  T  is  a  complex  function,  and  is  the  subject 
of  this  section.  In  general,  due  to  the  complexity  of  the  formulas  in  the  send  algorithm,  the  receiver  must 
re-run  the  send  algorithm  to  determine  which  elements  in  the  receive  buffer  correspond  to  which  elements 
of  B.  Then  we  apply  the  inverse  of  the  Map  function  defined  in  equation  (5)  to  determine  where  each  buffer 
element  should  be  placed  within  T.  Re-running  the  send  algorithm  and  applying  the  inverse  Map  function 
can  easily  be  integrated  into  a  single  operation. 

To  run  the  algorithm,  we  of  course  need  to  know  all  the  original  parameters.  Since  each  processor  runs 
the  same  code  in  the  SPMD  model,  we  already  know  the  values  of  /a,  La,  s.-i.  /b,  /b,  and  .sg.  We  also 
know  which  portion  of  the  left-hand  side  array  we  own,  Oump(A),  characterized  by  the  parameters  fp, 
h,  sp,  and  n-p.  The  only  parameters  we  don’t  necessarily  know  are  fs,  Is,  and  ns -  However,  in  a 
block-cyclic  distribution,  all  processors  involved  in  the  distribution  will  have  the  same  values  of  Is ,  sj>, 
and  ns-  Thus  the  only  parameter  that  has  to  be  passed  along  with  the  message  is  the  sending  processor's 
value  of  fs-  Alternatively,  if  the  receiving  processor  can  determine  the  identity  of  the  sending  processor, 
the  receiving  processor  can  calculate  or  look  up  the  value  of  fs  itself. 

Our  algorithm  to  receive  a  slice,  then,  is  identical  to  the  send  algorithm,  except  that  the  inverse  of  the 
Map  function  is  applied  to  the  assignments  to  last,  c,  and  r.  In  addition,  the  .s g  factor  in  the  stride  changes 
to  $.4.  The  algorithm,  Global-Index-Receive,  is  shown  in  Figure  10. 


=  euclid(5.4,sp) 

{X2,yi,9l)  =  ®UClid(^g5p/<7|.,S5  ) 

stride  =  sasv»sI(9\9z) 

last  =  min(/.4,/x>,/.4  +  »a[Us  ~  /b)/-**bJ) 

c  =  max(  fA  +  sA\( max(  fg,fs)~  Ib )/sb]  ,  fv ) 

DO  j  =  r (fv  -  f.A)/g\]  UPTO  [(fV  -  fA  +nv-  1  )/«7,J 

DO  i  =  f (fs  ~  fB  ~  jx\*g)/92]  UPTO  [(fs  -  fg  -  jx\Sg  +  n$  -  1  )/</; J 

r  =  fA  +  jx\s.A  +  ix2sAsv/g  i 
first  =  c  +  ((r  -  c)  mod  stride) 
input-slice(f  irst,  last. stride) 


Figure  10:  Global-Index-Receive  algorithm. 
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7.2  Local  memory 

As  in  the  case  of  sends,  we  must  convert  the  global  indices  formed  by  the  Global-Index-Receive  algorithm 
into  local  memory  indices.  The  mapping  function  is  the  same  as  before,  except  that  we  will  use  parameters 
fp,  b>  and  np  rather  than  fs,  Is ,  #s,  and  ns-  This  substitution  is  necessary  because  the  former 
parameters  describe  the  layout  of  the  left-hand  side  array,  with  which  the  data  is  to  be  aligned,  while 
the  latter  parameters  describe  the  layout  of  the  right-hand  side  array,  which  is  being  sent.  Using  the  same 
analysis  as  for  the  LOCAL-iNDEX-SEND  algorithm,  we  derive  the  algorithm,  Local-Index-Receive,  which 
determines  the  local  memory  indices  of  the  data  being  received,  as  shown  in  Figure  1 1 . 

=  eudid(s.4,st>) 

{xi,yi,9l)  =  euclid  (sBsp/g\,ss) 

stride  =  sAspssl(g\g2 ) 

lmstride  =  SAnpssligxgi) 

last  =  min (IaJvJa  +  -  /b)/sbJ) 

lmlast  =  [(last  -  fp)/sp\np  +  min((last  -  fp)  mod  sp.np  -  1)  +  K 
c  =  max{fA,fv,fA  +»a\Us  ~  /b)/*b1) 

DO  j  =  ((fp  -  f.A )/ flil  UPTO  [(fp  -fA+np-  l)/g,J 
t  =  (/a  +  jxiSA  -  fp )  tnod  Sp 

DO  i  =  [(fs  ~  Ib  ~  jx\SB)lgi\  UPTO  [(fs  -  fB  -  jx\SB  +  ns  -  1  )/tf2j 
r  =  /a  +  jx\sA  +  ix2sAsp/g  i 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fp  -  t)np/sp  +  t  +  K 
input-local-slice(  lmf  irst,  lmlast.  lmstride) 

Figure  1 1 :  Local-Index-Receive  algorithm. 


8  Computation 

After  the  communication  step  completes,  all  data  to  be  used  for  the  computation  will  be  local  to  each 
processor.  In  addition,  the  Local-Index-Receive  algorithm  works  in  such  a  way  that  the  portion  of  the 
right-hand  side  array  received  will  be  aligned  with  the  left-hand  side  array.  When  the  received  data  is  placed 
in  this  manner,  we  can  use  the  same  loop  index  to  access  both  the  left-hand  side  array  and  the  temporarily 
redistributed  right-hand  side  array. 

Assume  once  again  that  we  are  executing  the  statement  in  equation  (1).  When  we  perform  the  Local- 
Index-Receive  algorithm,  we  store  the  data  in  a  temporary  array  T  which  has  the  same  size  and  shape  as 
A.  Then  we  compute  by  looping  through  the  arrays  and  assigning  the  results. 

To  loop  through  the  arrays,  we  first  need  to  determine  the  global  indices  of  the  array  elements  on  which 
to  perform  the  computation.  This  set  is  given  by. 

Ownp(k)  fl(/.4  :  l a  '■  $a)- 

From  equations  (7)  and  (8),  we  have  that  a  representative  of  this  set  is  given  by  r  =  /,*  -f  jr\s A  for 
[h-t*]  <  j  <  [b-Lifnx',-1^  When  we  set  c  -  ma x(/,.j,/p)  to  be  a  lower  bound,  we  find  as  before 
that  the  true  lower  bound  is  given  by  c  +  ((r  -  c.)  mod  sa^v/g\  )■ 
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However,  we  still  need  to  map  this  global  index  to  local  memory.  We  do  this  by  applying  the  previously 
defined  LM  function  from  equation  (16)  to  the  upper  bound  and  the  true  lower  bound,  and  converting  the 
global  stride  of  SASp/gi  to  SAnp/g \. 

These  observations  yield  the  algorithm  Compute-Loop,  shown  in  Figure  12. 

(xi,yi,gi)  =  euclid^^sp) 
stride  =  SASp/g  t 
lmstride  =  SAnp/g  \ 
last  =  min(/,4,/z>) 

lmlast  =  [(last  -  fp)/sp\np  +  min((last  -  fp)  mod  sp,np  -  1 )  +  K 
c  =  mzx(fA,fp) 

DO  j  =  C (fp  -  fA)/gi]  UPTO  [{fp  -fA  +  np-  1 )/ <7i J 
f  =  9\i  +  /a  -  fv 
r  =  /a  +jxisA 

first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fp  -  j')np/sp  +  j'  +  K 
DO  i  =  lmfirst  UPTO  lmlast  BY  lmstride 
_ A(i)  =  /-(T(i)) _ 

Figure  12:  Compute-Loop  algorithm,  which  determines  which  elements  of  the  array  to  compute  once  the 
communication  has  completed. 


9  Overall  algorithm 

From  the  Local-Index-Send,  Local-Index-Receive,  and  Compute-Loop  algorithms,  we  now  have  the 
means  to  construct  a  general  algorithm  for  executing  a  single  array  assignment  statement.  This  algorithm, 
which  executes  on  all  processors  in  the  SPMD  style,  takes  the  following  form: 

/*  Send  phase  */ 

S  =  getmyidQ 
fs  =  getf  irst(B,<5) 

DO  V  =  0  UPTO  N  -  1 
fp  =  getf  izst{k,V) 

CALL  LOCAL-INDEX-SEND 
END  DO 

/*  Receive  phase  */ 

V  =  getmyid( ) 
fp  —  getf  irst(A,"P) 

DO  S  =  0  UPTO  N  -  1 
fs  =  getf  irst(B,«S) 
call  Local-Index-Receive 
END  DO 

/*  Computation  phase  */ 

call  Compute-Loop 

In  this  pseudo-code  we  assume  that  the  system  contains  N  processors,  and  each  processor  has  a  unique 
ID,  returned  by  the  function  getmyid,  between  0  and  N  —  1,  inclusive.  The  function  getf  irst  returns 
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the  first  index  of  the  array  (the  first  function  argument)  that  is  owned  by  the  processor  (the  second  function 
argument),  as  given  in  equations  (3)  and  (4).  Note  that  this  overall  algorithm  implies  that  a  processor  may 
send  a  message  to  itself  and  correspondingly  receive  from  itself. 

An  illustration  of  the  communication  algorithm  is  given  in  Figure  1 3.  In  the  first  conceptual  step,  a 
sending  processor  computes  an  intersection  of  global  array  indices.  In  the  second  step,  the  processor  maps 
these  global  indices  to  local  indices.  Next,  the  processor  copies  the  required  array  elements  into  a  send 
buffer;  note  the  irregular  order  in  which  the  elements  are  copied.  Finally,  the  sending  processor  sends  the 
buffer  to  the  destination  processor.  The  destination  processor  receives  the  data  into  a  receive  buffer,  and 
proceeds  to  copy  the  data  from  the  receive  buffer  into  a  temporary  array  corresponding  to  the  left-hand 
side  array.  Now  that  the  destination  processor  has  received  messages  from  both  sending  processors,  it  can 
continue  with  its  computation. 

Each  of  the  jV  send  phases  and  jV  receive  phases  are  data  independent  and  can  be  executed  in  any 
order  if  we  have  a  buffered  message  passing  system.  However,  there  is  a  dependence  between  the  send 
on  one  processor  and  the  corresponding  receive  on  the  destination  processor.  Thus  even  with  a  buffered 
message  passing  system,  we  must  schedule  the  send  and  receive  operations  to  avoid  deadlock.  For  example, 
scheduling  deadlock  could  occur  if  the  first  operation  on  every  processor  is  a  receive,  since  every  processor 
would  be  waiting  on  a  receive  but  no  processor  would  be  sending. 

Suppose  the  message  passing  system  is  not  buffered,  and  the  program  is  executing  on  a  MIMD  archi¬ 
tecture.  It  is  possible  for  a  “fast”  processor  to  complete  its  communication  during  phase  n ,  proceed  with 
its  computation,  and  begin  communication  phase  n  4-  1,  while  a  “slow”  processor  is  still  trying  to  complete 
communication  phase  n.  If  the  fast  processor  now  sends  a  message  to  the  slow  processor,  the  slow  processor 
has  the  problem  of  distinguishing  phase  n  messages  from  phase  n  +  1  messages.  This  potential  problem 
can  be  alleviated  by  inserting  a  barrier  synchronization  before  each  communication  phase.  No  barrier  is 
necessary  on  a  SIMD  architecture,  since  the  synchronization  is  implicit. 

10  Relaxing  restrictions 

In  this  section,  we  relax  the  simplifying  restrictions  imposed  in  Section  4,  regarding  the  format  and  type  of 
array  statements  allowed. 

10.1  Multiple  right-hand  side  terms 

The  right-hand  side  of  the  assignment  statement  could  have  multiple  terms.  Each  term  could  take  one  of 
the  following  forms: 

•  An  array  slice,  such  as  B(  1  :  n) 

•  A  single  array  element,  such  as  B{  i) 

•  A  scalar  value,  such  as  3  or  x 

In  the  first  case,  the  relevant  portions  of  an  array  slice  will  be  sent  to  each  processor  and  stored  in  a 
temporary  array,  aligned  with  the  left-hand  side  array.  In  the  second  case,  the  single  array  element  will 
be  broadcast  to  all  processors  involved  in  the  computation,  and  stored  in  a  temporary  scalar  variable.  In 
the  third  case,  the  scalar  value  will  be  used  in  place  without  any  communication,  since  its  value  will  be 
consistent  across  all  processors. 

If  there  are  several  terms  on  the  right-hand  side,  then  we  simply  perform  a  communication  step  if 
necessary  for  each  term.  The  receiving  processor  must  have  one  temporary  array  or  scalar  for  each  term 
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on  the  right-hand  side.  The  computation  loop  is  performed  as  before,  except  that  the  right-hand  side  of  the 
assignment  statement  in  the  inner  loop  has  a  reference  to  each  temporary  variable. 

An  even  better  method  for  handling  multiple  right-hand  side  terms  would  be  to  group  the  messages 
together  and  send  only  one  message  to  each  processor,  since  this  grouping  would  eliminate  the  overhead  of 
the  additional  messages.  The  send  buffer  would  then  contain  data  for  the  entire  right-hand  side,  rather  than 
for  just  one  right-hand  side  term.  If  it  was  determined  that  there  were  no  data  dependences  between  several 
consecutive  statements,  a  still  better  method  would  be  to  block  all  data  together  involving  right-hand  side 
terms  for  those  statements. 

10.2  Multidimensional  arrays 

For  the  most  part,  the  analysis  performed  above  extends  orthogonally  for  multidimensional  arrays.  For 
example,  if  we  had  the  statement: 

A(slicel,slice2)  =  ^(BfsliceB,  slice4)), 

we  would  determine  the  indices  of  the  array  elements  in  the  first  dimension  of  B  owned  by  processor  5  that 
needed  to  be  sent  to  processor  V,  and  then  do  the  same  for  the  second  dimension.  The  Cartesian  product  of 
these  two  sets  would  correspond  to  the  array  elements  to  be  sent. 

For  example,  suppose  {i|, . . . ,  im}  is  the  set  created  by  the  intersections  for  the  first  dimension,  and 
{ji,.  ■■  ,jn}  is  the  set  created  by  the  intersections  for  the  second  dimension.  Then  the  elements  to  be  sent 

are  B(  ,  j{ ),  B(  i, ,  j2 ),...,  B(  i | ,  jn ),  B(  i2,j i ),  B(  i2,ji ) . B(  im.jn ). 

But  problems  arise  when  an  array  reference  contains  both  scalar  and  slice  indices,  such  as  in  the 
statement: 

A(t,slical)  =  jF(B(slice2.j)). 

The  main  problem  here  is  that  the  analysis  presented  so  far  assumes  that  all  array  indices  are  slices,  not 
scalars.  A  secondary  problem  is  that  the  compiler  must  be  sure  to  match  up  slicel  with  slice2.  ;o 
preserve  the  semantics  of  the  array  assignment  statement. 

To  compile  multidimensional  array  statements,  we  must  do  the  following.  We  modify  the  definition 
of  the  Own  function  such  that  Ownp{C,k)  represents  the  set  of  indices  of  array  elements  that  processor  /> 
owns  in  dimension  k  of  C.  We  observe  that  in  the  above  statement,  processor  5  performs  a  send  only  if 
j  €  Owns( B,2).  Likewise,  processor  V  receives  data  only  if  i  €  Ownp(k.  1 ).  Therefore,  each  processor 
can  participate  in  a  send  or  receive  only  if  it  owns  the  array  element  corresponding  to  the  index  in  each 
scalar  dimension  of  the  array. 

After  the  scalar  dimensions  are  checked,  we  can  match  up  the  slice  dimensions  one  for  one  and  run  the 
algorithm  above.  As  an  example,  consider  the  assignment  statement: 

A(ij, slicel.  slice2,ji )  =  JF(B(  i2.  ji- slice3.  slice4)). 

Figure  14  shows  the  pseudo-code  that  processor  6'  executes  to  determine  what  portion  of  B  to  send  to 
processor  V.  Note  that  in  the  pseudo-code  given,  we  have  neglected  to  perform  the  mapping  to  local 
memory,  but  the  true  compiled  code  must  do  the  mapping. 

Processor  V  executes  a  similar  loop  nest  to  determine  what  array  elements  are  sent  by  processor  J>.  In 
addition,  processor  V  executes  a  similar  loop  for  the  computation  phase. 
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/*  Determine  send  for  A(i|,slicel,slica2,j|)  =  ^(8(12, J2,slice3,slice4))  */ 
lf(i’i  €  0ump(A,  1)  and  j\  €  0wn?>(k,4))  then 
If  (*2  €  Owns(B,  1)  and  j'2  €  0t<m$(B,2))  then 

For  each  <1  €  0um5(B.3)  n  Map(0wnz>{A,2)  D  slicel)  do 
For  each  <2  €  Otun,s{B,4)  n  Map(Ownp(k.2>)  D  slice2)  do 
Add  array  element  B( hj2Ji,  h )  to  the  list  of  elements  to  send 


Figure  14:  Pseudo-code  to  determine  the  send  for  a  multidimensional  array  assignment. 

103  Negative  strides 

In  an  assignment  statement,  it  is  possible  for  either  or  both  of  the  stride  components  of  the  slices  to  be 
negative.  Negative  strides  present  problems  for  a  number  of  reasons. 

Consider  a  slice  (/  :  / :  s),  and  suppose  that  sis  negative.  While  /  is  still  a  representative  in  the  modified 
slice  representation,  it  is  no  longer  the  case  that  /  is  a  lower  bound  and  l  is  an  upper  bound.  Instead,  the 
roles  of  /  and  l  are  reversed  when  s  is  negative.  In  this  case,  (/  :  Z  :  s)  is  equivalent  to  (/  :  /  :  -  s  :  /)  in 
the  modified  slice  representation  presented  in  Section  5. 

Another  problem  with  negative  strides  lies  with  Euclid’s  GCD  algorithm,  which  we  use  to  intersect 
slices.  This  algorithm  requires  bo  h  input  strides  to  be  nonnegative. 

Perhaps  the  most  difficult  problem  to  deal  with,  however,  is  with  respect  to  alignment.  Recall  that  one 
of  the  goals  of  the  send  and  receive  loops  is  for  the  right-hand  side  array  to  be  temporarily  realigned  with 
the  left-hand  siue  array,  so  that  the  same  loop  index  can  be  used  to  access  both  arrays.  When  the  signs  of 
the  two  strides  differ,  the  right-hand  side  area)  must  be  somehow  “reversed”  during  either  the  send  or  the 
receive  phase.  This  reversal  can  be  accomplished  by  changing  the  sign  of  the  stride,  and  reversing  the  roles 
of  the  upper  and  lower  bounds. 

We  address  the  reversal  problem  by  reversing  the  order  within  the  send  algorithm  when  .sj 9  is  negative, 
and  reversing  the  order  within  the  receive  algorithm  when  5.4  is  negative.  However,  there  are  more  changes 
in  the  basic  algorithm  which  must  be  made.  In  the  send  algorithm,  recall  from  equation  ( 1 3)  that  the  lower 
bound  c  is  defined  as  max{A/ap(/.t ),  Map(fp),fs}.  A  more  general  form  for  the  lower  bound  is: 

max{min( M ap( /a),  M ap(lA ) ). min(  M np(  /p ).  A Iap( l-p ) ).  min(  fsJs)}- 

A  more  general  form  for  the  upper  bound  defined  in  equation  ( 1 5)  is: 

min{max(  Map{  f a  ),  M  ap(lA ) ).  max(  Map(  /p ).  Map(  /p ) ).  max( 

The  lower  bound  for  the  receive  algorithm  can  be  defined  as: 

max{  min(  fA ,  l  a  )  -  min(  /p ,  /p ).  min(  Map' 1  ( fs )  •  Map~ 1  ( Is ) ) } . 

and  the  upper  bound  for  the  receive  algorithm  can  be  defined  as: 

min{  max(  fA  Ja),  fv .  h  )>  max(  Map~ 1  ( fs ).  M  ap~ 1  (Is))}  ■ 

Note  that  when  the  resulting  stride  is  negative,  c  and  last  are  swapped:  we  use  the  upper  bound  for  r  and 
the  lower  bound  for  last. 

In  the  definitions  of  the  algorithms  Global-Index-Send,  Local-Index-Send,  Glob  \l-Index- 
Receive,  Local-Index-Receive,  and  Compute-Loop,  we  have  assumed  that  both  strides  sA  and  sB 
are  positive.  For  the  send  and  receive  algorithms,  we  must  expand  the  algorithm  to  consider  four  cases, 
depending  on  the  signs  of  s. 4  and  sg.  The  COMPUTE-LOOP  algorithm  must  now  have  two  cases,  depending 
on  the  sign  of  .s^.  These  revised  algorithms  are  given  in  Appendix  A. 
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10.4  Partially  replicated  arrays 

In  a  relaxation  or  convolution  operation,  each  element  of  an  array  is  set  to  be  some  function  of  itself  and  its 
neighboring  elements.  A  natural  way  to  parallelize  this  operation  is  have  each  processor  compute  the  set  of 
array  elements  corresponding  to  the  section  of  the  array  it  owns.  However,  there  must  be  communication  at 
the  block  boundaries,  since  at  the  block  boundary  some  of  the  neighbors  of  the  boundary  element  will  reside 
on  a  different  processor.  One  possibility  for  dealing  with  this  communication  is  to  temporarily  redistribute  the 
array  into  a  form  where  copies  of  boundary  elements  reside  on  multiple  processors.  After  the  redistribution 
takes  place,  every  processor  can  compute  its  results  locally  without  any  additional  communication. 

When  redistributing  the  original  array  A  into  a  temporary  array  A',  we  give  A'  a  distribution  in  which 
ownership  is  overlapped  at  block  boundaries.  The  size  of  the  overlap  depends  upon  the  distance  of  each 
element  from  the  neighbors  it  needs  to  access.  Each  array  element  c-u  potentially  have  multiple  owners. 

Allowing  such  partially  replicated  arrays  will  complicate  the  algorithms  presented  above  only  slightly. 
In  the  general  case,  no  changes  need  be  made  to  the  algorithms  given.  However,  in  Section  10.1,  we  state 
that  if  the  right-hand  side  term  is  a  single  array  reference,  such  as  B(  i ),  then  the  owner  broadcasts  the  value. 
If  this  broadcasting  is  to  be  done  in  the  case  of  partially  replicated  arrays,  then  each  processor  must  be 
prepared  to  receive  multiple  broadcasts  (depending  on  how  many  processors  own  that  array  element),  or 
else  it  must  be  the  case  that  no  two  owners  send  the  array  element  to  the  same  processor. 

10.5  Nested  array  references 

A  nested  array  reference  takes  the  form  A(Q),  where  A  is  a  distributed  array  and  Q  is  an  array  slice  or  an 
array  element.  In  general,  we  know  nothing  at  compile  time  concerning  the  evaluation  of  Q.  Therefore,  we 
have  no  idea  which  processor  or  processors  own  A(q).  Thus  we  must  broadcast  Q  to  all  processors,  and  then 
proceed  with  the  owner-computes  rule.  This  compilation  method  could,  of  course,  be  quite  expensive,  both 
in  terms  of  execution  time  and  memory  usage. 

’Ye  achieve  efficiency  in  ordinary  array  statements  by  depending  on  the  fact  that  the  sequence  of  array 
indices  can  be  specified  as  a  slice  expression.  This  assumption  breaks  down  in  the  case  of  nested  array 
references.  An  efficient  solution  to  nested  array  references  is  beyond  the  scope  of  this  paper. 

11  Generating  array  ownership  descriptors 

In  Section  9,  the  getf  irst  function  was  referenced.  This  function  returns  the  first  index  of  a  particular 
array  that  is  owned  by  a  particular  processor.  (Actually,  tne  function  should  also  take  a  third  argument,  the 
array  dimension  of  interest.)  This  section  describes  how  this  function  can  be  implemented,  in  terms  of  the 
user-input  alignment  and  distribution  directives.  In  general,  this  function  will  be  expensive  to  compute, 
so  for  each  array,  it  may  be  beneficial  to  precompute  the  function  value  for  all  possible  inputs  (the  inputs 
consist  of  (processor,  dimension  number)  pairs)  and  store  the  results  in  a  table.  The  size  of  this  table  will 
be  equal  to  the  number  of  processors  in  the  system  times  the  number  of  dimensions  in  the  an-ay. 

In  addition  to  describing  the  implementation  of  the  getf  irst  function,  this  section  also  describes  how 
to  determine  the  block  size  and  stride  for  each  dimension  of  the  array,  which  are  the  n  p  and  sp  parameters, 
respectively,  used  in  equation  (3),  and  the  ns  and  parameters  used  in  equation  (4). 

If  the  system  contains  a  large  number  of  processors,  memory  and  time  constraints  may  make  it  impractical 
or  impossible  to  precompute  all  values  of  the  getf  irst  function.  In  this  case,  the  function  may  have  to 
perform  the  computation  every  time  it  is  called.  However,  for  the  remainder  of  this  section,  we  will  assume 
that  a  table  of  precomputed  results  is  being  built. 
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If  we  allow  dynamic  redistribution  of  arrays,  we  will  need  some  runtime  ownership  descriptor  for  each 
array  which  is  used  to  hold  the  parameters  required  by  equations  (3)  and  (4).  In  addition  to  the  table  of 
getf  irst  results,  the  runtime  descriptor  should  also  contain  the  block  size,  stride,  and  dimension  size  for 
each  array  dimension. 

We  assume  that  a  subset  of  the  Fortran  D  and  HPF  distribution  syntax  is  being  used.  That  is,  each  array 
is  ALIGNed  with  a  TEMPLATE,  which  is  DISTRIBUTEd.  When  the  distribution  type  is  specified  for  a 
template  dimension,  we  require  that  the  number  of  processors  over  which  to  distribute  the  dimension  is 
given.  For  example,  BLOCK(n)  orCYCLIC(n)  specifies  distributing  the  dimension  over  n  processors,  and 
BLOCK-CYCLIC( 6,  n)  specifies  distributing  the  dimension  over  n  processors  with  a  block  size  of  b.  We 
make  the  following  restrictions  on  Fortran  D’s  general  alignment  and  distribution  syntax: 

•  Arbitrary  alignment  functions  are  not  allowed.  Only  alignment  with  a  constant  offset  and  a  stride  of 
0  or  1  is  allowed.  For  example,  if  A  is  an  array  and  T  is  a  template,  A(  i )  could  be  aligned  with  T(  i ), 
T (i  +  1 ),  T( i  -  3),  orT(5),  but  not  T(3i)orT(4  -  i). 

The  purpose  of  this  restriction  is  to  force  array  ownership  sets  to  take  the  form  of  equations  (3)  and 
(4).  With  only  block-cyclic  distributions  allowed,  the  ownership  of  the  template  elements  can  always 
be  described  by  these  equations,  but  with  arbitrary  alignment  functions,  the  ownership  of  the  array 
elements  is  not  necessary  describable  by  these  equations.  With  these  restricted  alignment  functions, 
the  ownership  of  the  array  elements  can  always  be  described  by  a  block-cyclic  distribution. 

•  Every  dimension  of  a  template  must  be  distributed  as  BLOCK.  CYCLIC,  or  BLOCK-CYCLIC.  Note 
that  block  and  cyclic  are  special  cases  of  biock-cyclic.  Also  note  that  non-distribution  of  an  array 
or  template  dimension  is  allowed,  since  it  is  equivalent  to  distributing  over  one  processor  in  the 
dimension. 

In  Section  10.4,  we  describe  overlapped  array  ownership.  Specifying  such  an  overlap  can  be  done  with  a 
simple  extension  to  the  ALIGN  statement.  Ratherthanusingaconstructsuchas  “ALIGN  A(  / )  WITH 
we  might  use  “ALIGN  k(i  -  1  :  i  4-  1)  WITH  T(  i ),”  to  specify  that  any  processor  that  owns  template 
element  T(  i)  also  owns  array  elements  A(  i  -  1  :  i  +  1 ).  We  will  refer  to  the  left  overlap  and  riyht  overlap. 
which  in  this  case  are  -  1  and  1,  respectively. 

The  process  of  converting  an  ALIGN  statement  and  a  DISTRIBUTE  statement  into  an  ownership 
description  is  quite  complex.  Some  of  the  subtleties  involved  are  the  following: 

ALIGN  A(  i.j)  WITH  T(i).  In  this  case,  the  second  dimension  of  A  is  not  aligned  with  any  dimension  of 
T.  This  alignment  statement  is  equivalent  to  adding  a  dummy  dimension  to  T.  aligning  the  second 
dimension  of  A  with  the  dummy  dimension  of  T,  and  distributing  the  dummy  dimension  over  one 
processor. 

ALIGN  A(i)  WITH  T(f',3).  Here  is  an  example  of  alignment  with  a  constant  index.  If  the  second 
dimension  of  T  is  distributed,  there  may  be  some  processors  that  do  not  own  any  part  of  A. 

ALIGN  A  (i.j),  B(j.i)  WITH  T  (i.j).  Here  we  have  an  alignment  of  B  involving  index  permutation. 
When  computing  the  ownership  sets  of  A  and  B,  it  will  be  imponant  to  ensure  that  A(  i.j )  and  B(  j.  i ) 
are  owned  by  the  same  processor  for  all  legal  values  of  i  and 

When  we  convert  the  distribution  syntax  into  an  array  owners  iptor.  the  minimum  requirement 

is  the  following.  Suppose  we  have  two  arrays,  A  and  B,  a  template  l ,  and  the  statements. 
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ALIGN  A(i')  WITH  T(i) 
ALIGN  B(i)  WITH  T(t') 


We  must  ensure  that  A (i)  and  B(i)  are  owned  by  the  same  processor,  for  all  legal  values  of  i.  However, 
suppose  that  instead,  B  is  aligned  with  template  U,  which  is  declared  and  distributed  identically  to  T.  Then 
there  is  no  explicit  requirement  that  A (i)  and  B(i)  are  owned  by  the  same  processor. 

Our  approach,  however,  is  to  define  an  algorithm  which  will  consistently  produce  the  same  ownership 
descriptor  for  a  given  alignment,  distribution,  and  template  definition,  regardless  of  the  name  of  the  template. 
In  the  above  example,  A(i)  and  B(i)  will  be  owned  by  the  same  processor  regardless  of  whether  B  is  aligned 
with  T  or  U,  provided  T  and  U  are  declared  and  distributed  identically.  Our  algorithm  takes  as  input  an 
ALIGN  statement,  its  corresponding  DISTRIBUTE  statement,  the  TEMPLATE  definition,  and  a  processor 
number  q .  Each  processor  has  a  unique  value  of  q  between  0  and  TV  -  1 ,  inclusive,  where  N  is  the  number 
of  processors  in  the  system.  The  algorithm  returns  a  vector  /,  which  contains  the  index  corresponding  to 
the  first  element  owned  by  processor  q  in  each  corresponding  dimension  of  the  array.  The  algorithm  can 
also  return  a  “null  vector”,  which  signifies  that  the  processor  does  not  own  any  part  of  the  array.  We  will 
assume  that  the  compiler  has  already  performed  all  necessary  error  checking. 

The  first  step  in  the  algorithm  is  to  convert  the  TEMPLATE,  ALIGN,  and  DISTRIBUTE  statements  into 
a  normal  form  which  includes  no  index  permutation,  where  each  dimension  of  the  array  is  aligned  with  a 
dimension  of  the  template,  and  where  each  dimension  of  the  template  is  distributed.  This  conversion  is 
done  in  the  following  manner  (an  example  will  follow): 

1.  Determine  the  number  of  processors  assigned  to  each  dimension  of  the  template.  If  a  dimension  is 
specified  not  to  be  distributed  (i.e.,  a  appears  in  that  dimension  in  the  DISTRIBUTE  statement), 
change  its  distribution  to  be  BLOCK-CYCLIC(  n,  1 ),  for  any  desired  block  size  n.  The  actual  block 
size  chosen  is  irrelevant;  the  important  issue  is  that  the  dimension  is  distributed  over  only  one 
processor.  Let  vector  v  contain  the  quantities  of  processors  assigned  to  the  dimensions  of  the  template 
(i.e.,  there  are  vt  processors  assigned  to  dimension  i  of  the  template).  If  f[j  r,  >  .V,  where  V  is  the 
number  of  processors  available  to  execute  the  assignment  statement,  then  signal  an  error.  If  q  >  fjj  ,Tj. 
where  q  is  the  processor  number,  then  return  the  null  vector,  since  the  array  is  only  distributed  across 
the  first  UjVj  processors. 

2.  From  v  and  the  processor  number  q,  compute  the  vector  p,  by  setting  p,  =  [q/  fl^o  FjJ  mod  r,  for 
each  value  of  i.  For  any  two  distinct  values  of  q  less  than  fjj  the  corresponding  p  vectors  will 
differ  in  at  least  one  component. 

This  p  vector  just  assigns  each  processor  to  a  unique  nonnegative  integer  point  in  the  A- -dimensional 
space  bounded  by  vector  v,  where  k  is  the  length  of  vectors  p  and  v. 

3.  Consider  all  constant,  and  slice  indices  in  the  template  expression  of  the  ALIGN  statement.  If  the 
given  processor  does  not  own  that  part  of  the  template  (see  below),  return  the  null  vector.  Otherwise, 
remove  these  types  of  indices  from  the  template  expression  of  the  ALIGN  statement.  Also  remove 
the  corresponding  dimensions  from  the  TEMPLATE  statement,  the  DISTRIBUTE  statement,  r,  and 
P- 

The  processor  determines  whether  it  owns  this  part  of  the  template  by  considering  the  bounds  of  that 
dimension  of  the  template  and  the  block  size  n,-.  In  a  cyclic  distribution,  the  block  size  is  1 ,  and  in  a 
block  distribution,  the  block  size  is  [S./r,],  where  S,  is  the  number  of  elements  in  that  dimension, 
i,  of  the  template.  In  a  block-cyclic  distribution,  the  block  size  will  be  given.  Assume  that  in  this 
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dimension,  i,  of  the  template,  indices  range  from  c,  to  dp,  in  Fortran,  c,  is  usually  1,  and  in  C,  c,  is  0. 
There  are  three  possibilities  to  consider: 


•  If  a  constant  is  specified  (e.g.,  ALIGN  A(i)  WITH  T(3)),  and  {k,  -  c,  -  p.n,)  mod  n,v,  <  n,, 
where  k,  is  the  constant  index  (A:,  =  3  in  the  example),  then  the  processor  owns  the  template 
element. 

•  If  a  is  specified  (e.g.,  ALIGN  A(i)  WITH  T(*)),  and  q  <  vj,  then  the  processor  owns 
the  section  of  the  template. 

•  If  a  slice  is  specified  (e.g.,  ALIGN  A(z)  WITH  T(1  :  10  :  2)),  and  the  intersection  of  that  slice 
an<*U?=oI(ci  +  Kni  +  J  '■  di  :  n.  ui)  is  nonempty,  then  the  processor  owns  the  template  section. 

4.  If  there  is  a  symbolic  index  (such  as  i  or  j)  in  the  array  expression  of  the  ALIGN  statement  that  does 
not  appear  in  the  template  expression,  change  that  index  to  a  in  the  array.  The  goal  of  this  step 
is  to  identify  unaligned  dimensions  of  the  array.  For  example,  ALIGN  A (i,j)  WITH  T(i)  would 
change  to  ALIGN  A (*»  WITH  T (i). 

5.  For  each  in  the  array  expression  of  the  ALIGN  statement,  replace  it  with  a  new  dummy  variable. 
Add  a  corresponding  1  to  the  end  of  v  and  a  0  to  the  end  of  p.  Modify  the  DISTRIBUTE  statement 
so  that  each  of  these  dummy  dimensions  is  distributed  BLOCK-CYCLIC(  n.  1 )  for  any  desired  value 
of  n.  Add  a  dimension  to  the  template  with  the  same  index  bounds  as  the  corresponding  dimension 
of  the  array. 

6.  Permute  the  indices  in  the  template  expression  of  the  ALIGN  statement  so  that  they  match  up  with 
the  indices  in  the  array  expression.  Apply  the  same  permutation  to  v,  p,  the  DISTRIBUTE  statement, 
and  the  dimension  sizes  of  the  template.  For  example,  ALIGN  A(  i.j)  WITH  T(  j  +  1 .  i  -  2)  would 
change  to  ALIGN  A (i,j)  WITH  T( i  -  2,  j +  1 ),  and  the  same  permutation  would  be  applied  to  r, 
p,  the  DISTRIBUTE  statement,  and  the  dimension  sizes  of  the  template. 

Vector  v  now  specifies  the  number  of  processors  assigned  to  each  corresponding  dimension  of  the  array, 
and  vector  p  specifies  the  processor’s  order  in  each  dimension  of  the  array. 

As  an  example  of  the  steps  performed  so  far,  suppose  we  have  the  following  statements,  and  perform 
the  above  steps: 


ALIGN  A(i,j-  1  :  j+2,k,l)  WITH  T(j,3. i  +  2) 
DISTRIBUTE  T( BLOCK(  2 ) ,CYCLIC(  3 ) ,BLOCK( 4 ) ,CYCLIC(  5 ) ) 


1.  The  DISTRIBUTE  statement  yields  F  =  (2. 3,4.5).  Note  that  we  must  have  at  least  2x  3x4x5  =  120 
processors  executing  this  code  section. 

2.  Suppose  that  our  processor  is  number  1 19.  Then  p  =  ( 1 . 2. 3.4). 

3.  There  is  one  constant  index  in  the  template  expression  of  the  ALIGN  statement:  the  3.  Assume  that 
this  processor  owns  that  part  of  the  template,  so  we  continue.  Removing  that  dimension  everywhere 
gives  us: 


ALIGN  A  (i.j  —  1  :  j  +  2.k.l)  WITH  T  {j.k,i  +  2) 
DISTRIBUTE  T( BLOCK( 2 ),BLOCK( 4).CYCLIC( 5 ) ) 
v  =  (2,4,5),  p  =  (1.3.4) 
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4.  The  l  appears  as  an  index  of  the  array  expression  in  the  ALIGN  statement,  but  it  does  not  appear  in 
the  template  expression,  so  we  change  it  to  a  t: 

ALIGN  A(i,j  -  1  :j  +  2 ,k,*)  WITH  T (j,k,i  +  2) 

DISTRIBUTE  T(BLOCK(2),BLOCK(4),CYCLIC(5)) 
v  =  (2,4,5),  p  =  (1,3,4) _ 

5.  There  is  one  *  in  the  array  expression  of  the  ALIGN  statement;  replace  it  by  a  dummy  d,  resulting  in: 

ALIGN  k(i,j  -\:j  +  2,k,d)  WITH  T  (j,k,i  +  2  ,d) 

DISTRIBUTE  T(BLOCK(2),BLOCK(4),CYCLIC(5),BLOCK-CYCLIC(  1,1)) 
v  =  (2,4,5, 1),  p  =  (1,3, 4,0) _ 

6.  Applying  the  permutation  gives  us: 

ALIGN  A  (ij  -  1  :  j  +  2,k,d)  WITH  T  (i  +  2  ,j,k,d) 

DISTRIBUTE  T(CYCLIC(5),BLOCK(2),BLOCK(4),BLOCK-CYCLIC(  1,  1)) 
v  =  (5,2,4, 1),  p=  (4, 1,3,0) _ 

At  this  point,  we  have  converted  the  specifications  to  a  normal  form  as  described  previously.  We  can 
now  use  the  information  given  by  the  index  bounds  and  the  alignment  offsets  to  generate  the  result  vector. 
Suppose  that,  after  applying  the  above  steps,  we  have  the  following  declarations: 

DIMENSION  k(a\  : z\,a2  :  S2,...) 

TEMPLATE  T(c|  :  di,c2  :  d2,. . .) 

ALIGN  A(i(  +  o[  :  i  |  +  of,  i2  +  ok  :  h  +  ok - )  WITH  T(ij  +  b\.  ii  +  63 - ) 

DISTRIBUTE  T(...) _ 

At  this  point,  we  can  easily  determine  which  elements  of  the  template  that  processor  <7  owns,  but  these 
template  elements  must  be  translated  into  elements  of  the  array.  In  dimension  j  of  the  template,  the 
processor  owns  the  elements  of  the  template  corresponding  to  the  indices: 

n'r' 

[jUj+i-dj:  sj). 

1=0 

The  parameter  n'  is  the  block  size  in  dimension  j  of  the  template,  which  can  be  found  in  the  DISTRIBUTE 
statement;  /j  =  c:  +  PjTi'y,  and  Sj  =  n'  vj. 

When  we  apply  the  alignment  function,  we  find  that  the  processor  owns  the  elements  of  the  array 
corresponding  to  the  indices: 

n,-1 

IJ  +  ‘  :  Hj  ■  *j) 

1=0 

fj  =  Cj  -  hj  +°J  +  i>j”r  (,7) 

where  nj  -  n';  4 -  of  -  of .  This  f  vector  is  returned. 
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As  mentioned  previously,  we  need  a  separate  function  to  determine  the  block  size  and  stride  for  a 
dimension  k  of  an  array,  given  the  TEMPLATE,  ALIGN,  and  DISTRIBUTE  statements.  The  first  step 
of  this  algorithm  is  to  determine  which  dimension  of  the  template  with  which  dimension  k  of  the  array 
is  aligned  by  considering  the  ALIGN  statement.  If  the  array  dimension  is  not  aligned  with  any  template 
dimension,  then  that  array  dimension  is  not  distributed,  and  thus  each  processor  owns  every  element  in  that 
dimension.  In  this  case,  we  can  return  any  values  for  the  stride  and  block  size,  as  long  as  they  are  equal. 

If  the  array  dimension  is  aligned  with  a  template  dimension,  consider  the  distribution  of  that  template 
dimension  in  the  DISTRIBUTE  statement.  The  DISTRIBUTE  statement  gives  the  number  of  processors 
v  over  which  the  dimension  is  distributed,  and  it  gives  some  indication  of  the  template  block  size.  If  it 
is  a  block-cyclic  distribution,  the  template  block  size  is  given  explicitly.  If  it  is  a  cyclic  distribution,  the 
template  block  size  is  implicitly  1 .  If  it  is  a  block  distribution,  the  template  block  size  is  [S/u] ,  where  S  is 
the  number  of  elements  in  that  dimension  of  the  template.  The  true  block  size  is  the  sum  of  the  template 
block  size  and  the  total  overlap  in  the  corresponding  array  dimension.  The  stride  is  simply  the  product  of 
the  template  block  size  and  v.  We  return  the  stride  and  the  block  size. 

12  Optimizations 

The  algorithms  Local-Index-Send,  Local-Index-Receive,  and  Compute-Loop  are  designed  to  work 
in  the  worst  case,  when  none  of  the  parameters  is  known  at  compile  time.  However,  it  is  almost  never 
the  case  that  all  parameters  are  unknown  at  compile  time.  For  example,  quite  frequently  the  strides  in  the 
assignment  statement  will  be  1;  and  if  they  are  not  1,  they  are  still  likely  to  be  compile-time  constants. 
Knowing  both  strides  at  compile  time  can  provide  some  of  the  best  optimization  potential,  because  Euclid's 
GCD  algorithm  can  be  executed  at  compile  time  rather  than  at  runtime. 

When  parameters  are  known  at  compile  time,  potential  optimizations  can  range  from  simple  constant 
folding  to  tightening  loop  bounds  or  eliminating  certain  loops  altogether. 

12.1  Reducing  potential  communication 

Recall  that  in  the  general  algorithm  presented  in  Section  9,  the  outer  loops  in  the  send  and  receive  phases 
have  each  processor  considering  all  other  processors  for  communication.  However,  one  of  the  user’s  goals 
(or  the  goal  of  a  separate  compiler  phase)  should  be  to  carefully  align  and  distribute  the  arrays  to  minimize 
communication.  When  the  compiler  detects  “good”  alignments  and  distributions  at  compile  time,  it  can 
achieve  further  runtime  efficiency  by  generating  code  that  restricts  the  set  of  processors  to  consider  at 
runtime  for  communication. 

If  the  alignment  and  distribution  are  specified  automatically  by  a  separate  compiler  phase,  that  phase 
might  already  have  determined  what  communication  must  occur  for  a  given  array  statement.  Otherwise,  we 
can  look  for  the  common  communication  patterns  described  below. 

It  is  important  to  note  below  that  these  tests  must  be  applied  separately  to  each  dimension  of  each  array. 
If,  for  example,  we  determine  that  communication  occurs  with  at  most  two  processors  in  each  of  the  two 
array  dimensions,  then  overall  communication  will  be  with  at  most  four  processors.  We  can  only  be  certain 
that  communication  occurs  with  at  most  one  processor  if  we  can  determine  independently  for  each  array 
dimension  that  communication  occurs  with  at  most  one  processor. 

12.1.1  Communication  with  a  small  number  of  processors 

In  some  cases,  regardless  of  the  upper  and  lower  bounds  of  the  slices,  we  can  determine  at  compile  time 
that  each  processor  sends  to  at  most  k  other  processors,  and  correspondingly  receives  from  at  most  k  other 
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processors,  where  k  is  some  small  integer.  This  will  happen  when  each  array  is  distributed  over  the  same 
number  of  processors,  and  the  ratio  of  the  strides  in  the  assignment  statement  is  equal  to  the  ratio  of  the 
strides  of  the  distributions.  As  an  example  of  the  magnitude  of  k,  when  these  conditions  are  met,  and  the 
alignment  function  contains  no  overlap,  the  value  of  k  is  at  most  2. 

More  formally,  we  can  determine  that  we  send  to  at  most  two  processors  and  receive  from  at  most  two 
processors  when: 


saSs  =  sbsv  and  vA  =  vb ,  (18) 

where  va  and  vb  are  the  numbers  of  processors  assigned  to  the  respective  array  dimensions. 

For  this  analysis,  we  introduce  the  concept  of  a  virtual  processor ,  for  the  following  reasons.  Suppose 
we  have  an  infinite  number  of  virtual  processors,  and  block-cyclic  distributions.  Then  we  can  assign  each 
virtual  processor  one  contiguous  block  of  array  elements.  If  sAss  =  *b*t><  then  each  virtual  processor 
will  send  to  one  or  two  virtual  processors  which  are  a  fixed  (possibly  negative)  virtual  communication 
distance  to  the  right,  and  will  correspondingly  receive  from  a  fixed  virtual  distance  to  the  left.  Due  to 
the  nature  of  the  block-cyclic  distribution,  a  virtual  processor  is  mapped  to  a  real  processor  by  taking  the 
virtual  processor  number  modulo  the  number  of  real  processors.  When  the  virtual  communication  distance 
is  fixed,  it  translates  to  a  fixed  real  communication  distance  only  when  the  corresponding  array  dimensions 
are  distributed  over  the  same  number  of  real  processors  (i.e.,  vA  =  vb  )■ 

Let  ca,  bA,  and  oLA [  be  the  c,  r,  and  oL  parameters  from  equation  (17)  corresponding  to  the  left-hand  side 
array  of  the  assignment  statement,  and  let  cb,  bB,  and  Og  be  the  similar  parameters  for  the  right-hand  side 
array.  We  want  to  find  p\  -  Pb,  where  pA  is  the  virtual  processor  that  owns  A(  f  A  +  ksA  )  and  pn  is  the 
virtual  processor  that  owns  B( Zb  +  ksB),  for  any  value  of  k.  From  equation  ( 1 7),  we  know  that  a  particular 
processor  pA  owns  the  elements  corresponding  to  indices: 

cA-bA  +  Pasv/va  +  oA  :  cA  -  bA  +  Pa»v/va  +  oLA  +  nv  -  1 


and  that  a  particular  processor  pb  owns  the  elements  corresponding  to  indices: 

CB  ~  bB  +  Pbss/vb  +  Og  :  cB  -  bB  +  ps*s/ vB  +  Og  +  ns  -  I . 


Given  a  particular  value  of  k,  we  would  like  to  find  bounds  on  pA  and  pB,  where  pA  is  the  processor  that 
owns  array  element  fA  +  ksA  and  pB  is  the  processor  that  owns  array  element  fB  +  ksB-  We  know  that: 

ca  ~  bA  +  Pasv/va  +  oA  <  fA  -I-  ksA  <  cA  -  bA  +  pA*v/cA  +  oLA  +  np  -  1 
cB  ~  bB  +  Pbss/vb  +  °b  <  Zb  +  ksB  <  cB  -  bB  +  pB»s/ "B  +  <>B  +  -  1  • 

These  equations  yield  the  following  relations: 

fA  +  ksA  -  c, 4  +  bA  -  ola  -  nv  +  1  fA  +  ks A  -  cA  +  bA  -  »A 

sp/vA  -  -  -sp/'M 


/b  +  ksB  cb  +  bB  —  Og  —  f?5  +  1  „  ^  /b  +  ksg  -  CB  +  bB  - 

- - -  <  PB  <  - ; - CL 

*s/vB 

These  equations  allow  us  to  find  an  upper  bound  on  pA  -  pB,  which  is  less  than  or  equal  to  [p,\  -  pp\ 
since  pA  and  pB  are  integers: 


PA  -  PB  < 


f  A  +  k.SA  -  C,4  +  6,4  -  QLa  (  Zb  +  k. Sg  -  CB  +  l)B  -  Off  -  "5  +  1 


■W/»A 


*s/l'B 


Va 


/, i  -  c a  +  bA  -I-  ol\  _  Zb  ~  CB  +  bB  -  o'g  -  ns  +  1 

•Sp  N<; 
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Using  similar  analysis,  we  find  that  a  lower  bound  on  pa  -  pB  is: 


PA~  PB  > 


VA 


(  /a  ~  CA  +  bA  +  oLa  -  np  +  1 
\  Sp 


Zb  -  cb  +  bB  -  oB\ 
ss  ) 


Let  D\  be  the  lower  bound  on  pa  -  pb,  and  let  Di  be  the  upper  bound  on  pa  -  Pb-  Any  processor  p 
will  only  send  to  virtual  processors  p  +  D i  through  p  +  D2,  inclusive.  Similarly,  any  processor  q  will  only 
receive  from  virtual  processors  q  —  through  q  -  D\,  inclusive.  Virtual  processors  are  transformed  into 
real  processors  by  taking  the  virtual  processor  modulo  the  number  of  real  processors  in  that  array  dimension. 


12.1.2  Small  number  of  senders  or  receivers 

We  can  determine  that  a  small  number  of  the  processors  will  perform  a  send  if  one  of  the  following  conditions 
holds: 

•  n's  =  8s-  This  property  just  means  that  the  array  dimension  is  distributed  across  only  one  processor. 
This  information  is  not  particularly  useful. 

•  $b  is  a  multiple  of  ss .  This  property  means  that  every  array  element  accessed  on  the  right-hand  side 
is  guaranteed  to  fall  on  one  of  a  small  set  of  processors  (the  size  of  the  set  is  1  if  o£  =  o§),  regardless 
of  how  many  elements  are  accessed.  Bounds  on  the  sending  virtual  processor,  ps ,  can  be  determined 
by  solving  the  inequality: 

cs  -bs  +  ps^s/vs  +  <>s  <  Zb  <  cs  -  bs  +  ps$s/vs  +  o$  +  ns  -  1 


which  yields: 


bs  -  cs  -  05  -  ns  +  1  +  Zb 

bs  -  cs  -  05  +  Zb 

Ss/vs 

2:  PS  S 

ss/vs 

(19) 


•  [(Zb  -  cs  +  bs)/ns J  =  [(Ib  -  cs  +  bs)/ns\  and  o|  =  o§.  This  property  means  that  Zb  and  lB 
fall  within  the  same  block  and  are  thus  guaranteed  to  reside  on  the  same  processor,  as  are  all  elements 
that  fall  between  Zb  and  lB.  The  virtual  processor  ps  is  given  by  equation  (19). 

Similarly,  we  can  determine  that  only  one  of  the  processors  will  perform  a  receive  if  one  of  the  following 
conditions  holds: 

•  np  =  sp.  This  property  just  means  that  the  array  dimension  is  distributed  across  only  one  processor. 
This  information  is  not  particularly  useful. 

•  « 1  is  a  multiple  of  sp.  As  above,  this  property  means  that  every  array  element  accessed  on  the 
left-hand  side  is  guaranteed  to  fall  on  one  of  a  small  set  of  processors  (the  size  of  the  set  is  1  if 
Op  =  op),  regardless  of  how  many  elements  are  accessed.  The  single  receiving  virtual  processor,  pp , 
can  be  determined  by  solving  the  inequality: 

cp  —  bp  +  ppsp/vp  +  Op  <  Za  <  cp  -  bp  +  ppsp/vp  +  Op  +  np  —  1 

which  yields: 

bp  -  cp  -  Op  -  np  +  1  -f  Za 

sp/vp 


<Pv< 


bp  -  cp  -  Op  +  Za 

sp/rp 


(20) 
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•  [.if a  ~  cv  +  bp)/np J  =  [{l a  -  cp  +  bp) / np]  and  o£  =  Op.  This  property  means  that  fA  and  lA 
fall  within  the  same  block  and  are  thus  guaranteed  to  reside  on  the  same  processor,  as  are  all  elements 
that  fall  between  fA  and  lA.  The  virtual  processor  pp  is  given  by  equation  (20). 

The  principal  benefit  of  this  single  sender/receiver  test  is  that  if  it  applies,  we  can  wrap  the  send  phase  or 
the  receive  phase  inside  a  conditional  which  tests  the  processor  number.  In  this  way,  the  entire  send  phase 
or  receive  phase  can  be  eliminated  from  consideration  for  some  processors. 

12.2  Eliminating  loops 

If  the  send,  receive,  and  computation  loops  are  implemented  exactly  as  described  in  Figures  9,  11 ,  and  1 2,  it 
is  possible  that  the  costs  of  computing  the  loop  bounds  will  dominate  the  execution  time  of  the  program.  One 
obvious  solution  to  this  problem  is  for  the  compiler  to  perform  a  kind  of  constant  folding  by  precomputing 
such  parameters  as  strides  and  the  results  of  Euclid’s  algorithm,  at  compile  time.  In  general,  it  will  be 
difficult  to  precompute  the  loop  bounds,  since  they  depend  on  fp  and  /$,  which  can  have  different  values 
for  different  processors. 

At  compile  time,  we  are  likely  to  know  whether  the  distribution  of  any  given  array  is  block,  cyclic, 
or  block-cyclic.  If  either  the  left-hand  side  array  or  the  right-hand  side  array  is  block  or  cyclic,  then  we 
need  only  one  outer  loop  in  the  Local-Index-Send  and  Local-Index-Receive  loops,  rather  than  two. 
If  neither  array  is  block-cyclic,  we  can  eliminate  the  i  and  j  loops  altogether.  If  the  left-hand  side  array  is 
not  block-cyclic,  we  can  eliminate  the  j  loop  in  the  COMPUTE-LOOP  algorithm. 

These  outer  loops  are  due  entirely  to  the  block-cyclic  distribution.  A  block-cyclic  set  is  the  union  of 
disjoint  slices,  and  characterizing  the  union  requires  a  loop.  In  the  Local-Index-Send  and  Local-Index- 
RECEIVE  algorithms,  we  need  two  extra  loops,  since  equation  (6)  involves  two  block-cyclic  distributions  in 
the  intersection. 

Both  block  and  cyclic  distributions  can  be  specified  as  a  single  slice.  When  /  is  the  index  corresponding 
to  the  first  array  element  that  a  processor  owns,  /  is  the  largest  array  index,  and  the  dimension  is  distributed 
across  v  processors,  a  cyclic  distribution  can  be  described  as: 

and  a  block  distribution  can  be  described  as: 

where  5  is  the  number  of  elements  in  the  template  dimension  that  the  array  dimension  is  aligned  with. 

This  observation  gives  more  concise  ways  of  defining  Ownp(k)  and  Owns( B)  than  in  equations  (3) 
and  (4).  With  these  new  definitions,  we  can  use  the  methods  presented  in  Section  6  to  derive  new  Local- 
Index-Send,  Local-Index-Receive,  and  Compute-Loop  algorithms.  Notice  that  there  are  eight  new 
sets  of  communication  algorithms  to  derive,  since  each  of  the  two  arrays  in  the  assignment  statement  can 
have  one  of  three  types  of  distributions,  and  that  there  are  two  new  sets  of  computation  algorithms  to  derive. 
The  new  LOCAL-INDEX-SEND  and  LOCAL-INDEX-RECEIVE  algorithms  are  given  in  Appendix  B,  and  the 
new  Compute-Loop  algorithms  are  given  in  Appendix  C. 

12  J  Customizing  the  output 

If  we  are  compiling  for  a  MIMD  architecture,  further  optimization  can  be  achieved  by  producing  customized 
output  for  each  processor.  If  we  know  at  compile  time  that  only  a  certain  subset  of  processors  will 
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be  executing  certain  operations,  we  normally  output  code  that  conditionally  executes  depending  on  the 
processor  ID.  But  if  we  create  customized  code  for  each  processor,  the  conditionals  can  be  evaluated  at 
compile  time,  not  at  runtime,  and  some  loop  bounds  can  be  precomputed.  This  approach  could  further  speed 
up  execution. 

Note,  however,  that  this  approach  does  not  scale  well  as  the  number  of  processors  in  the  system  increases. 
It  could  take  quite  long  just  to  compile  if  the  number  of  processors  is  large.  Thus  we  probably  will  not  want 
to  take  this  approach  unless  we  truly  want  the  utmost  execution  speed. 

On  the  other  hand,  it  may  be  the  case  that  certain  subsets  of  processors  usually  perform  similar  operations. 
In  this  case,  we  might  want  to  partition  the  processors  and  generate  one  customized  output  file  for  each 
subset  of  the  partition. 

12.4  Merging  receive  and  computation  phases 

Recall  that  the  overall  algorithm  temporarily  rea  ibutes  the  right-hand  side  arrays  of  the  assignment 
statement  to  match  the  distribution  of  the  hand  side  array,  and  then  performs  the  computation  phase. 
We  can  improve  upon  this  method  if  there  <nly  one  array  slice  on  the  right-hand  side,  and  the  memory 
accessed  by  the  two  array  slices  is  disjoint  (i.e.,  there  are  no  data  dependences).  In  this  case,  we  can 
perform  all  communication  of  single  array  elements  on  the  right-hand  side  (see  Section  10.1),  and  perform 
the  communication  for  the  right-hand  side  slice  last.  However,  rather  than  receiving  the  slice  elements  and 
storing  them  in  a  temporary  array,  we  can  receive  each  element,  perform  the  computation,  and  directly  store 
the  result  into  the  left-hand  side  array,  thus  allowing  us  to  merge  the  receive  and  computation  phases. 

To  illustrate,  the  sequential  Fortran  code  given  in  Section  2  would  conceptually  change  to  the  following 
after  applying  this  optimization: 

>-B  =  /fl 

DO  ij\  —  /ai^Ai^A 

A(m)  =  JF(B(is)) 

=  IB  +  SB 
END  DO 

13  Further  applications  of  array  statements 

We  have  derived  a  general  method  for  compiling  and  optimizing  general  array  assignment  statements. 
However,  the  programmer  will  often  need  to  have  a  larger  set  of  array  operations  at  his  disposal.  In 
this  section  we  describe  simple  extensions  of  array  assignment  statements  that  allow  more  powerful  array 
constructs. 

13.1  Redistributing  arrays 

From  time  to  time,  the  user  (or  a  separate  compiler  phase)  may  wish  to  change  the  distribution  of  an  array. 
For  example,  when  operating  on  an  array,  one  algorithm  might  be  more  efficient  when  only  the  rows  of  the 
array  are  distributed,  while  a  subsequent  algorithm  might  demand  that  the  array  be  distributed  by  columns. 

Such  a  redistribution  can  be  simulated  by  an  array  assignment  statement.  Suppose  we  wish  to  redistribute 
array  A.  We  can  create  a  declaration  for  a  new  array  A',  with  the  same  type  and  dimension  bounds  as  A. 
but  with  the  desired  new  alignment  and  distribution.  The  assignment  statement  A'  =  A.  when  compiled  in 
the  above  manner,  will  generate  the  necessary  communication  and  copying.  To  use  the  redistributed  array. 
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subsequent  references  to  A  should  be  replaced  by  references  to  A'.  Provided  that  the  memory  space  of  A  is 
disjoint  with  the  memory  space  of  A',  the  optimization  described  in  Section  12.4  always  applies. 

Notice  that  the  redistribution  method  described  here  requires  that  distributions  be  statically  known  at 
compile  time.  If  distributions  are  not  fully  known  at  compile,  the  compiler  must  use  only  the  runtime  array 
ownership  descriptors  to  determine  communication.  In  this  case,  most  of  the  communication  optimizations 
described  above  can  no  longer  apply. 

13.2  WHERE  statement 

The  WHERE  statement  is  a  valuable  construct  from  Fortran  90.  It  allows  conditional  data-parallel  execution 
of  array  statements.  The  general  form  is  (see  the  Fortran  90  reference  [1]  for  a  more  precise  definition): 

WHERE  mask-expr 
assn-stmt-l 
assn-stmt-2 
ELSEWHERE 

assn-stmt-3 

assn-stmt-4 

ENDWHERE 

The  sizes  and  shapes  of  the  sub-arrays  defined  by  the  mask  expression  and  the  left-hand  sides  of  the 
assignment  statements  must  be  the  same,  and  the  assignment  statements  must  all  be  array  assignment 
statements.  We  first  create  a  boolean  mask-array  which  has  the  same  distribution  as  the  left-hand  side 
array.  The  array  is  initialized  with  the  mask-expr,  and  indexed  with  the  same  array  slices  as  the  left-hand 
side  of  assn-stmt-l.  Then  assn-stmt-l  is  evaluated,  subject  to  the  true  values  in  the  mask-array.  Next,  the 
mask-array  is  redistributed  (see  above)  to  match  assn-stmt-2,  and  the  process  is  repeated  for  the  rest  of 
the  assignment  statements  inside  the  WHERE  section.  Within  the  ELSEWHERE  section,  the  same  method  is 
used,  except  that  we  use  negated  values  from  the  boolean  mask-array. 

As  an  example,  consider  the  following  construct: 

WHERE  A (fA  :lA:sA)^  B (fB  :  lB  :  sB) 

C(fc'-h-sc)  =  D(/d  :  Id  ■  sq) 

E(/e  :  Ie  ■  se)  =  F(/f  :  If  ■  sf) 

ELSEWHERE 

G( fa  ■  fa  ■  sc)  =  H {/h  '■  Ih  '■  sh) 

l(fi  :  1 1  :  .$i)  =  J (fj  :  lj  :  sj ) 

ENDWHERE 

The  transformations  described  above  would  result  in  the  following  code: 

Ml(  fc  ■  lc  '■  sc)  =  (A (fA  :  lA  :  *A)  ±  B(/a  :  lB  ■  *b)) 

C(/c  :  lc  ■  sc)  =  0(/o  :  Id  ■  sd )  [subject  toMlJ 
M2  (/e  ■  Ie  ■  se)  =  Ml  (fc  ■  lc  ’■  sc) 

E( /e  '■  l e  '■  se  )  =  F(/f  :  If  ■  sf)  [subject  to  M2] 

M3  (fa  ■  la'  sc)  =  M2  [Je  :  Ie  ■  se) 

G  (fc  ■  lc  :  sc;)  =  H  (/«:///:  .s  H)  [subject  to  negation  of  M3] 

M4 (//  :  //  :-*»/)  =  M3(/c  :  lG  :  V; ) 

!(//  :  I/  :  S[)  =  J  (fj  :  lj  :  .s  j)  [subject  to  negation  of  M4] 
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133  Reduction  operators 

An  array  reduction  statement  takes  the  form: 

x  =  Reduce{k(fA  :lA:sA)), 

where  Reduce  is  a  reduction  operator.  A  reduction  operator  is  an  associative  binary  operator  that  reduces 
an  array  or  array  section  to  a  single  scalar  value.  Typical  reduction  operators  are  minimum  value,  maximum 
value,  location  of  maximum  value,1  sum,  and  product.  Note  that  these  examples  of  reduction  operators  are 
also  commutative.  The  remainder  of  this  section  will  assume  that  the  reduction  operator  is  commutative. 

The  reduction  can  be  evaluated  efficiently  by  having  each  processor  perform  a  local  reduction  on  the 
specified  portion  of  the  array  that  it  owns.  This  local  reduction  yields  one  value  per  processor,  and  all 
processors  can  then  combine  their  results  to  yield  the  global  reduction.  This  final  result  can  be  broadcast  to 
all  processors,  if  necessary. 

To  perform  the  local  reduction,  processor  V  needs  to  perform  its  local  reduction  on  the  elements  of  A 
given  by  indices: 

Ownv(k)  f!  (fA  :  lA  :  sA). 

The  calculation  of  this  set  is  identical  to  that  in  Section  8,  and  the  Compute-Loop  algorithm  can  be  used 
almost  unchanged  to  calculate  the  local  reduction. 

13.4  Parallel  independent  loop  iterations 

Our  experience  has  shown  us  that  the  array  constructs  described  so  far  are  simply  not  powerful  enough 
in  general  to  produce  acceptable  performance  for  many  real  programs.  For  example,  consider  the  two- 
dimensional  Fast  Fourier  Transform  (2D  FFT)  algorithm.  This  algorithm  first  performs  a  1 D  FFT  operation 
independently  on  each  row  of  an  n  x  n  array,  and  then  performs  a  ID  FFT  operation  independently  on  each 
column.  The  ID  FFT  operation  is  difficult  to  parallelize,  and  array  assignment  statements  are  simply  not 
powerful  enough  to  effectively  parallelize  it. 

An  obvious  source  of  parallelism  for  the  2D  FFT  algorithm  is  in  the  outer  loop.  Since  there  are  no 
loop-carried  dependences,  each  iteration,  which  consists  of  a  ID  FFT  operation  on  a  row  or  a  column,  can 
proceed  in  parallel.  Furthermore,  if  each  processor  owns  only  entire  rows  (i.e.,  the  array  is  distributed  only 
in  the  first  dimension),  no  communication  need  occur  during  the  row-wise  ID  FFT  loop.  Similarly,  no 
communication  occurs  within  the  column-wise  ID  FFT  loop  when  the  array  is  distributed  only  in  the  second 
dimension.  These  observations  motivate  the  definition  of  a  basic  parallel  loop  construct. 

This  parallel  loop  can  take  the  following  form: 

PDO  i  =  /,  /,  s 
INPUT  k(i, :) 

OUTPUT  B(:,i) 

. . .  statements  reading  k(  i, :)  and  writing  B( :,  i) 

END  PDO 


The  INPUT  section  lists  the  distributed  array  sections  that  are  read  in  each  loop  iteration,  and  the 
OUTPUT  section  lists  the  distributed  array  sections  that  are  written.  The  PDO  loop  index  may  only  appear 
in  one  dimension  of  these  array  sections.  To  compile  this  loop,  we  first  choose  an  arbitrary  one-dimensional 

‘The  location  of  minimum/maximum  value  reduction  operator  actually  returns  a  vector,  rather  than  a  single  scalar  value,  when 
the  reduced  array  expression  is  multidimensional. 
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template  and  distribution,  and  we  redistribute  the  arrays  in  the  input  and  output  sections  so  that  the  dimension 
containing  the  loop  index  is  now  aligned  with  the  new  template,  using  the  method  of  redistribution  described 
above.  This  redistribution  ensures  that  all  data  read  or  written  during  any  one  iteration  of  the  loop  is  owned 
by  one  processor,  and  thus  no  intra-iteration  communication  is  necessary.  Each  processor  independently 
iterates  through  the  indices  of  the  template  that  it  owns  and  that  are  in  the  slice  (f  :  l :  s).  This  intersection 
can  be  found  using  equations  (7)  and  (8).  At  the  completion  of  all  the  loop  iterations,  the  arrays  are 
redistributed  to  their  original  forms. 

As  an  optimization,  the  compiler  can  try  to  choose  a  new  template  and  distribution  that  matches  as  many 
arrays  in  the  input  and  output  lists  as  possible.  When  the  distribution  and  template  match  that  of  an  array  in 
the  input  list,  it  is  not  necessary  to  redistribute  that  array. 

14  Conclusion 

As  High  Performance  Fortran  (HPF)  comes  of  age,  compiler  writers  who  wish  to  implement  HPF  for  a 
private  memory  multiprocessor  will  first  need  to  implement  communication  and  memory  management  for 
distributed  arrays.  We  have  reduced  this  problem  to  finding  intersections  of  index  sets  characterized  by 
the  three-parameter  array  slice  notation,  (f  :  l  :  s).  By  treating  block-cyclic  distribution  sets  as  unions  of 
disjoint  slices,  we  have  derived  an  efficient  means  of  calculating  the  communication  between  processors. 
This  calculation  includes  the  mapping  of  global  indices  into  local  memory. 

In  addition  to  calculating  the  communication,  we  have  provided  a  means  of  converting  TEMPLATE, 
ALIGN,  and  DISTRIBUTE  statements  into  array  ownership  descriptors  which  are  needed  at  runtime. 
Finally,  we  have  derived  communication  optimizations  for  many  common  cases.  All  derivations  are 
presented  in  detail  to  allow  the  compiler  writer  to  gain  a  better  understanding  of  the  final  results. 

This  work  has  been  validated  by  the  implementation  of  the  Fx  compiler  on  the  iWarp  system,  a  private 
memory  multiprocessor,  and  the  results,  including  optimizations  for  common  cases,  have  been  promising. 
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A  Extended  algorithms  for  sigr  ed  strides 

In  the  derivations  of  the  Global-Index-Send  and  LOCAL-INDEX-SEND  algorithms  presented  in  Section  6, 
the  GLOBAL-INDEX-RECEIVE  and  LOCAL-Index-Receive  algorithms  in  Section  7,  and  the  COMPUTE- 
Loop  algorithm  in  Section  8,  we  assumed  that  the  strides  .  ..  Jid  sB  specified  in  the  assignment  statements 
were  positive.  In  Section  10.3,  we  described  at  a  high  level  how  to  modify  the  algorithms  when  one  or  both 
strides  are  negative.  This  appendix  contains  the  algorithms  which  account  for  the  signs  of  the  strides  in  the 
assignment  statement. 

A.l  Extended  Local-Index-Send  algorithm 

In  this  algorithm,  note  that  there  are  four  main  parts.  In  each  of  the  last  three  parts,  lines  that  contain  a 
difference  between  the  corresponding  line  in  the  first  part  are  preceded  by  an  asterisk  (*). 

IF  sA  >  0  AND  SB  >  0  THEN 
(*l,5i,5i)  =  euclid(s4,sx>) 

(*2,2/2,52)  =  euclid(sB5p/5i,S5) 

stride  =  sBsv»s /{gxgz) 

lmstride  =  sBspns/(g\g2 ) 

last  =  min(/B  +  sb  [(min(/.4,/p)  -  /a)/sa\.Is) 

lmlast  =  [(last  -  fs)/$s\ns  +  min((last  -  fs)  mod.s5.n5  -  1)  +  K 
c  =  ma x(/B  +  sB[(max(/.4,/p )  -  fA)/sA]-fs) 

DO  j  =  r {fv  ~  lA)/g\ 1  UPTO  [(/p  -  /. 4  +  np  -  1  )/ff, J 

DO  i  =  \(fs  -  /b  -  jx\sB)lg{\  UPTO  [(fs  -  Jb  ~  jx\sB  +  ns  -  1  )/52j 
i'  =  gzi  ~  fs  r  fB  +  j*l$B 
r  =  fB  +  JX\SB  +  ix2SBst>/g\ 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fs  -  i')ns/ss  +  i'  +  K 
output-local-slice(  lmf  irst,  lmlast  lmstride) 

*  ELSE  IF  5.4  >  0  AND  sB  <  0  THEN 

(*i,2/i,5i)  =  euclid(s,4,  sp) 

*  (*2  ,  52,52)  =  auclid( -sBsv/g\ ,  ->’5 ) 
stride  =  sB»V»S  KoxOl) 
lmstride  =  sBst>ns /(g\g2 ) 

*  last  =  max(  fB  +  sB  [( min(/.4 .  /p )  -  fA )/ *a]  .  fs ) 

lmlast  =  [(last  -  fs)/ss\ns  +  min((last  -  fs)  mod.s5.n5)  -f  K 

*  c  =  min(  fB  4-  *b  [( max(  fA.fv)~  Ja  •  Is ) 

DO  j  =  \(fv  ~  fA)/0\ 1  UPTO  [;/p  -  f A  +  np  -  1  )/5i J 

DO  i  =  [(fs  ~  /b  -  JX\SB  )/gi]  UPTO  [( fs  -  fe~  Jx  i  SB  +  »S  -  1  )/52j 
>'  =  52<  ~  fs  +  /«  +jx\sB 
r  =  fB  +  JX\sB  +  iX2*B*v/g\ 

*  f irst  =  c  —  ((c  —  r)  mod  (-stride)) 

lmf  irst  =  (first  -  fs  -  i')ns/ss  +  i'  +  K 
output-local-slice(  lmf  irst.  lmlast.  lmstride ) 

*  ELSE  IF  .s  4  <  0  AND  .sB  >  0  THEN 

*  (*l,5l-5i)  =  euclid(-.S4,.sp) 

(*2.52,52)  =  euclid(.sfi.sp/7,..s5) 
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stride  =  sBspss/(gig2) 
lmstride  =  sBspnsl(g\g2) 

*  last  =  +  sB[(max(lAJv)  -  /a)/sa\Js) 

lmlast  =  [(last  -  fs)/ss\ns  +  min((last  -  fs)  mod  55,  ns  -  1)  +  K 

*  c  =  max(/B  +  sB  f(min(/.4,/z>)  -  /a)/sa]Js) 

DO  j  =  ((fp  -  /a  )/g{\  UPTO  l(  fv  -fA  +  nv-  1  )/g\\ 

do  i  =  Ms  -  /b  -  jx\sB)lg{\  UPTO  [{fs  -  fB  ~  jxisB  +  ns  -  1  )/g2 J 
i'  =  52*  -  fs  +  /b  +  jx\sB 
r  =  fB  +  jxisB  +  ix2sBsp/gi 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fs  -  i')ns/ss  +  i'  +  K 
output-local-slice(  lmf  irst,  lmlast,  lmstride) 

*  ELSE  IF  s.4  <  0  AND  sB  <  0  THEN 

*  (zi,!/l,gt)  =  euclid(-sA,^v) 

*  (X2,V2,92)  =  euclid(-SBSD/ffi,S5) 
stride  =  sBsvss /(5152) 
lmstride  =  sBspns/(g\g2 ) 

*  last  =  max(/B  +  sB[(max(^,/p)  -  fA)/sA\,fs) 

lmlast  =  [(last  -  fs)/ss\ns  4-  min((last  -  fs)  mod  55,05)  +  K 

*  c  =  min(  fB  +  sBf(  min(  fAJp)-fA)/sA}Js) 

DO  j  =  \{fv  -  fA)/gi 1  UPTO  [(fp  -  f  \  +  np-  1  )/<7,J 

DO  i  =  ((fs  -  fB  ~  jxiSB)/g2]  UPTO  [(fs  -  fB  -  jx\sB  +  05-1  )/g2  J 
i'  =  g2i  -  fs  +  fB  +JX\SB 
r  =  fB  +jxisB  +  ix2sBsp/g\ 

*  f irst  =  c  -  ((c  -  r)  mod  (-stride)) 

lmf  irst  =  (first  -  fs  -  i')ns/ss  +  i'  +  K 
output-local-slice(  lmf  irst.  lmlast.  lmstride) 

ENDIF 

A.2  Extended  Local-Index-Receive  algorithm 

As  before,  there  are  four  nearly  identical  parts  to  this  algorithm,  and  differences  are  noted  by  asterisks. 

IF  .S.4  >  0  AND  sB  >  0  THEN 
{x\.y\.gi)  =  euclid(.s.4,sp) 

(X2, 92,92)  =  euclid(  sBsp/g\ ,  ss ) 

stride  =  sAspss/(g\g2) 

lmstride  =  -s.4n-p.s5 /( <71  <72 ) 

last  =  min(/.4,/p./.4  +  sA[(ls  -  /b)/sbJ) 

lmlast  =  [(last  -  fp)/sp\np  +  mini (last  -  fp)  mod  sp.  np  —  1 )  +  K 
c  =  max(  fA ,  fp,  fA  -f-  .s.4  \(fs  ~  fB  )/sb\  ) 

DO  ;  =  [(fp  -  f A )  /  <7 1 1  UPTO  [(fp  -  fA  +  np  -  1  )/fir,J 
t  =  (fA  +  jx\sA  -  fp )  mod  sp 

DO  i  =  ((  fs  ~  fB  ~  jx\SB)/92]  UPTO  [(fs  -  fB  ~  jx\*B  +  "S  ~  1  )/<7:J 
r  =  }a  +jx)SA  +  ix2sAsp/gt 
first  =  c  +  (( r  -  c)  mod  stride) 
lmf  irst  =  (first  -  fp  -  t)np/sp  +  t  +  K 
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input-local-slice(  lmf  irst,  lmlast,  lmstride) 

*  ELSE  IF  SA  >  0  AND  Sg  <  0  THEN 

(xi,2/i,5i)  =  euclid(5^,5z>) 

*  (X2, 52,52)  =  enclid(-sBsv/guss) 
stride  =  sAs-pss/(gi92 ) 
lmstride  =  sAnvss/(g\g2) 

*  last  =  min(lA,b,fA  +  sA[(fs  -  /b)/«bJ  ) 

lmlast  =  [(last  -  fv)/sp\np  +  min((last  -  /p)  mod  sp.np  -  1 )  +  K 

*  c  =  max(/4,/p,/A  +  si4r(/5-/B)/sBl) 

DO  j  =  T(/p  -  fA)/g{\  UPTO  [(/p  -  fA  +  Rp  -  l)/5lJ 
t  =  (fA  +  jx\sA  -  /p)  mod  sp 

DO  i  =  [(/s  -  fg  -  jx\sg)/g2]  UPTO  [(fs  -  /b  ~  jx\Sg  +  ns  -  1  )/<?2j 
r  =  fA  +jx[SA  +  ix2sAsv/g\ 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  (first  -  /p  -  2)np/sp  +  i  +  K 
input-local-slice(  lmf  irst,  lmlast,  lmstride) 

*  ELSE  IF  S. 4  <  0  AND  sB  >  0  THEN 

*  (xt, 2/i,5l)  =  euclid( -s.4,sp) 

(X2, 52, 52)  =  eucl  id(  sBst> /g\,ss) 
stride  =  sAspss/(gig2 ) 
lmstride  =  sAnvss/(g\g2) 

*  last  =  max.(lA,fv,fA  +  sA[(ls  -  /s)/<SflJ) 

lmlast  =  [(last  -  /p)/spjnp  +  min((last  -  /p)  mod  sp./ip)  +  K 

*  c  =  min(fA,Ip,fA  +  sA\(fs  “  Jb )/5b]  ) 

DO  j  =  R/p  -  /.4)/5ll  UPTO  [(/p  -  /.4  +  «P  -  1  )/5lJ 
<  =  (//!+  jx,s.4  -  /p )  mod  sp 

DO  i  =  [(/s  -  /fl  -  jx|SB)/g2l  UPTO  [(/$  -  Jb  -  jx\Sg  +  ns  -  1  )/<72j 
r  =  fA+  jx{SA  +  ix2sAsp/g\ 
first  =  c  -  ((c  —  r)  mod  (-stride)) 
lmf  irst  =  (first  -  /p  -  <)np/.sp  +  /  +  K 
input-local-slice(  lmf  irst,  lmlast,  lmstride) 

*  ELSE  IF  5^  <  0  AND  sB  <  0  THEN 

*  (x  1, 2/1,51)  =  euclid(— s.4,sp) 

*  (X2, 52,52)  =  auclid(  -SB^v/giiSs ) 
stride  =  s^spss/^i^) 
lmstride  =  sAnpSsl(g\gz) 

*  last  =  max(/.4,/p,/,M  +-s.4[(/s  -  /b)/*bJ) 

lmlast  =  [(last  -  /p)/-spj  Rp  +  min((last  -  /p )  mod  -sp.  up )  -f  K 

*  c  =  min(/4,/p,/4  +  ^[(/.s  -  /b)/*b!) 

DO  j  =\{fv~  !a  )/g i]  UPTo  [(/p  -  fA  4-  Rp  -  1  )/5iJ 
t  =  (fA+  ]X\SA  -  fv)  mod  -sp 

DO  i  =  \(  fs  ~  /b  -  jx\» e )/52]  UPTO  ’(/s  -  /b  -  JX|.ss  +  05  -  1  )/<72J 
r  =  fA  +jxisA  +  ix2sAST>/g\ 

*  f  irst  =  c  -  ((c  -  r)  mod  ( -stride)) 

lmf  irst  =  (first  -  /p  -  t)np/sp  -f  /  -f  K 
input-local  -  .  1  ice(  lmf  irst,  lmlast.  lmstride) 
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ENDIF 


A3  Extended  Compute-Loop  algorithm 

In  this  algorithm,  there  are  two  similar  cases.  Differences  are  once  again  marked  by  an  asterisk. 

IF  sA  >  0  THEN 

(*l,yi>0l)  =  euclid  (s4,si>) 
stride  =  sAsp/g\ 

Imstride  =  sAnp/g\ 
last  =  min(/^,/x>) 

lmlast  =  [(last  -  fp)/sp\np  +  min((last  -  fp)  mod  sp.np  -  1)  +  K 
c  =  max{fA,fp) 

DO  j  =  \{fp  -  fA)/g il  UPTO  Kfp  -  fA  +  np  -  l  )/9l\ 

?  =  g\j  +  f a-  h 

r  =  fA  +jx\sA 

first  =  c  +  ((r  —  c)  mod  stride) 
lmf  irst  =  (first  -  fp  -  j')np/sp  +  j'  +  K 
DO  i  =  lmf irst  UPTO  lmlast  BY  Imstride 
A(i)  =  Jr(T ■(»)) 

*  ELSE  IF  SA  <  0  THEN 

*  (x,, ffi)  =  euclid(-sA, sp) 

*  stride  =  -sAsp/g\ 

*  Imstride  =  -sAnp/g  i 

*  last  =  mi  n(fA,lp) 

*  lmlast  =  [(last  -  fv)/sp\np  +  min((last  -  fp)  mod  sp.np  -  I )  +  K 

*  c  =  max(  l  A ,  fp ) 

DO  j  =  Kfp-  fA)/gx]  UPTO  [{fp  -  f A  +  np  -  \ )/ <7,J 
j'  =  g\j  +  Ja  -  fv 
r  =  fA  +  jx\SA 

first  =  c  +  ((r  —  c)  mod  stride) 

lmf  irst  =  (first  -  fp  -  j')np/sp  +  j1  +  K 

DO  i  =  lmfirst  UPTO  lmlast  BY  Imstride 

A(i)  =  JF(T(0) 

ENDIF 

B  Optimized  communication  for  block  and  cyclic  distributions 

When  the  distributions  of  one  or  both  arrays  are  known  to  be  block  or  cyclic,  rather  than  block-cyclic, 
our  send  and  receive  algorithms  can  be  optimized.  When  both  distributions  are  block-cyclic,  the  send  and 
receive  algorithms  contain  three  nested  loops:  the  j  loop,  the  i  loop,  and  the  slice  loop  (see  Figure  7  for 
an  example).  The  j  loop  is  due  to  the  left-hand  side  array  being  block-cyclic,  and  the  i  loop  is  due  to  the 
right-hand  side  array  being  block-cyclic. 

When  either  array  is  distributed  block  or  cyclic,  we  no  longer  need  its  corresponding  outer  loop  to 
characterize  the  intersections.  Although  some  optimizations  could  be  applied  by  substituting  constants  into 
the  equations,  we  will  still  always  have  the  three  nested  loops.  We  can  achieve  the  most  efficiency  by 
rederiving  the  equations  with  the  new  definitions  of  Ovmp{ A)  and  Owns (B). 
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Here  we  show  the  new  Local-Index-Send  and  Local-Index-Receive  algorithms. 

B.l  Case  A:  block-cyclic,  block 

Array  A  is  block-cyclic,  and  B  is  block.  The  ownership  sets  are  given  by: 

nx>  — 1 

Own-oik)  =  (J  Uo  +  }  :  h  '■  so) 
j'= o 

Owns{ B)  =  (f's  ■  fs  +  ns  -  1:1) 

The  revised  Local-Index-Send  algorithm  is  the  following: 

(*l>yi>5l)  =  euclid(s.4,  so) 
stride  =  sBso/g\ 
lmstride  =  sBsp/gi 

last  =  min(/B  +  saLCmin^/p)  -  /a)/«aJi/s  +  ns  -  1) 
lmlast  =  min(last  -  fs, ns  -  1)  4-  K 
c  =  max(/g  +  sB\(max(fA,fo)  ~  Ja)/sa]Js) 

DO  j  =  r  Uo-  fA)/g{\  UPTO  Kfo  -  fA  +  no-  l  )/gil 
r  =  fB  +jxiSB 

first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  first  -  fs  +  K 

output-local-slice(  lmf  irst,  lmlast.  lmstride) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(xi,yi,g\)  =  euclid(5.4.  so) 
stride  =  sAso/g\ 
lmstride  =  -sAno/g  i 

last  =  min(/.4,/p,  fA  +  sA  [(fs  +  ns  -  1  -  /b)/*bJ  ) 

lmlast  =  [(last  -  fp)/sp\np  +  min((last  -  fp)  mod  .s p. np  -  1 )  +  F 

c  =  max(  fA ,  fo  -  f a  +  *  A  \(fs  -  fB  )/sb]  ) 

DO  j  =  r  (fo-  fA  )/gi]  UPTO  [(fo  -  fA  +  np-  I  )/gt  J 
t  =  (f a  +  jx  i-s.4  -  fp)  mod  sp 

x  —  fA  +  jX]SA 

f irst  =  c-f-((r  —  c)  mod  stride ) 

lmf  irst  =  (first  -  fo  -  t)np/sp  +  t  +  K 

input-local-slice(  lmf  irst.  lmlast.  lmstride) 

B.2  Case  B:  block-cyclic,  cyclic 

Array  A  is  block-cyclic,  and  B  is  cyclic.  The  ownership  sets  are  given  by: 

n-p  —  1 

Ownoi A)  =  [J  (fo+j':lp-sp) 
y=o 


Owns) B)  =  ( fs  '  Is  ■  -“s ) 


The  revised  Local-Index-Send  algorithm  is  the  following: 

(*t,5i,5i)  =  euclid(s.4,sp)  /*  *  1  must  be  nonnegative  */ 

(*2,52,52)  =  euclid(sfi5t»/ffii«5) 

(*3,53,53)  =  9UClid(li5Siff2) 

stride  =  sbsvss  l{g\9i) 

lmstride  =  sbsv/{9\92 ) 

last  =  min(/s  -I- ss[(min(/4,/p)  -  /a)/3aJ,/s) 

lmlast  =  [(last  -  fs)/ss\  +  K 

c  =  ma x(/b  +  ss[(max(/4,/p)  -  /a)/sa1,/s) 

IF  (/b  -  /s)  mod  53  ==  0  THEN 
Cj  =  lifv  -  /a)/5i] 

do  j  =  Cj  +  (( fs  -  fB)xi/g3  -  Cj)  mod  53  UPTO  [(/p  -  /a  +  nv  ~  1  )/5iJ 
BY  52/53 

r  =  /b  +  ja:i5B  +  (/$  ~  /b  -  j*l-SB  )x2$BSV  /  (9\92) 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fs)/ss  +  K 

output-local-slice(  lmf  irst,  lmlast,  lmstride) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(*i,2/i,5i)  =  euclid(s4,5p)  /*  *  1  must  be  nonnegative  */ 

(*2,2/2,52)  =  euclid(  sbsv /51 ,  S5 ) 

(*3,2/3,53)  =  euclid(iisB,52) 

stride  =  SASpSs/(gi92) 

lmstride  =  34712)35/(5152) 

last  =  min( IaJv-Ia  +  5a [(Is  ~  /b)/5bJ  ) 

lmlast  =  [(last  -  /p)/s pjnp  +  min((last  -  /p)  mod  5p.  np  -  1 )  +  K 
c  =  max(/4,  /2>,  /a  +  34  [(/s  -  /b)/sb]) 

IF  (/b  ~  /$)  mod  53  ==  0  THEN 
Cj  =  [(/p  “  /a  )/ 5ll 

DO  j  =  ^  +  ( ( fs  -  /b )* 3 / 53  -  C;)  mod  53  UPTO  [(/p  -  /a  +  »P  ~  >  )/5iJ 
BY  52/53 

r  =  /a  +  .?*iS4  +  {/s  -  /s  -  j*|SB)*2«A-sp/(5l52) 

t  =  (/a  +  j*  1 3.4  -  /p )  mod  sp 

first  =  c  +  ((r  -  c)  mod  stride) 

lmf  irst  =  (first  -  /p  -  /  )np[sp  +  /  4-  K 

input-local-slice(  lmf  irst.  lmlast.  lmstride) 


B3  Case  C:  block,  block-cyclic 

Array  A  is  block,  and  B  is  block-cyclic.  The  ownership  sets  are  given  by: 


Oump(A)  =  (fv  '  fv  +  nv  ~  1:1) 


n$-  I 

Owns(B)  =  (J  (fs  +  i'  ■  h  ■  35) 

i'=0 
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The  revised  Local-Index-Send  algorithm  is  the  following: 

(*2,02,02)  =  auclid(^B,^5) 
stride  =  sbAs/02 
lmstride  =  SBn$/g2 

last  =  min(/8  +  sB[(mm(lA,fv  +  nv  -  1)  -  fA)/sA\Js) 

lmlast  =  [(last  -  fs)/ss\ns  4-  min((last  -  fs)  mod  ss,ns  -  1)  +  K 

c  =  max(/s  +  3B[(max{fA,fv)  -  fA)/sA],fs) 

DO  i  =  f (fS  -  /s )/ 9l]  UPTO  l(fS  -fs  +  ns-  1  )/52J 
*'  =  92*  -fs  +  fB 
r  =  fB  +  *x2sB 

first  =  c  +  ((r  -  c)  mod  stride) 

lmf irst  =  (first  -  fs  -  *')ns/ss  +  i'  +  K 

output  -  local  -  s  1  ice(  lmf  irst ,  lmlast ,  lmstride ) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(*2,02,02)  =  euclid(^B,^) 
stride  =  sAss/g2 
lmstride  =  sAss/g2 

last  =  min (lA,fv  +  nv  -  1,/a  +  sA[(ls  -  f b)/*b\  ) 
lmlast  =  min(last  -  /p,  nv  -  1 )  +  K 
c  =  ma\{fAJvJA  +sA\(fs  -  Ib)/sb ]) 

DO  i  —  [(/$  —  fB )/ 02]  UPTO  [{fs  -fB  +  ns-  1  )/02j 
r  =  fA  +  ix2sA 

first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  first  -  fv  +  K 

input-local-slice(  lmf  irst,  lmlast,  lmstride) 

B.4  Case  D:  block,  block 

Array  A  is  block,  and  B  is  block.  The  ownership  sets  are  given  by: 

Ownv( A)  =  (fv  ■  fv  +  nv  -  1:1) 

Owns{ B)  =  (fs  -fs  +  ns-  1  :  1) 

The  revised  Local-Index-Send  algorithm  is  the  following: 

stride  =  »b 
lmstride  =  sb 

last  =  min(/B  +  sB  [(min(/.4,  /p  +  nv  -  1 )  -  fA  )/s.4J  .fs  +  ns  -  1 ) 
lmlast  =  min(last  -  fs,ns  -  1 )  +  K 
c  =  max(  fB  +  sB f( max(  fA ,/p)  -  fA )/.s.4] ,  fs ) 
r  =  /e 

first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  first  -  fs  +  K 

output-local-slice(  lmf  irst,  lmlast,  lmstride) 
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The  revised  Local-Index-Receive  algorith  n  is  the  following: 


stride  =  sA 
lmstride  =  sa 

last  =  min (lA,fv  +  nv  -  \,fA  +  sA[{fs  +  ns  -  1  -  fB)/sB J) 
lmlast  =  min(last  -  /p,  np  -  1)  +  K 
c  =  max(fA ,fv,fA  +  sA\(fs  -  fB)/sB]) 

T  —  f A. 

first  =  c  +  ((r  —  c)  mod  stride) 
lmf irst  =  first  -  /p  +  K 

input-local-slice(  lmf  irst,  lmlast,  lmstride) 

B.5  Case  E:  block,  cyclic 

Array  A  is  block,  and  B  is  cyclic.  The  ownership  sets  are  given  by: 

Otunp(A)  =  (/p  :  /p  +  np  —  1  :  1) 
Owns(B)  =  ( fs  :  Is  ■  ss) 

The  revised  Local-Index-Send  algorithm  is  the  following: 

(*2,2/2,02)  =  euclid(se,s5) 
stride  =  sBss/g2 
lmstride  =  sB/g2 

last  =  min(/B  +  sB[(min(lA,  fo  +  nv  -  1)  -  fA)/sA\Js) 
lmlast  =  [(last  -  fs)/ss\  +  K 
c  =  max(  fB  -\-  sB  ("( max(  fA ,  fv )  -  fA  )/sA] ,  fs ) 

IF  (/a  -  fs)  mod  g2  ==  0  THEN 

r  =  fB  +  (fs  ~  fB  )X2SB /<72 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  (first  -  fs)/ss  +  K 
output-local-slice(  lmf  irst,  lmlast.  lmstride) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(*2,02,02)  =  SUClid(5B,35) 

stride  =  sA.$s/g2 
lmstride  =  sAss/g2 

last  =  min(/^,/p  +  rap  -  1 . fA  +  sA [(ls  -  fB)/sB J) 

lmlast  =  min(last  -  /p,  rap  -  l )  +  K 
c  =  max( fA,fv,  fA  +  sA\(fs  -  fB)/sB  1) 

IF  (fB  -  fs)  mod  02  ==  0  THEN 
r  =  f A  +  (fs  ~  /b)*2-5.4/02 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  first  -  /p  +  K 

input-local-slice(  lmf  irst,  lmlast,  lmstride) 
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B.6  Case  F:  cyclic,  block-cyclic 

Array  A  is  cyclic,  and  B  is  block-cyclic.  The  ownership  sets  are  given  by: 


Ownp(k)  =  (fp  :  Ip  :  sp) 


ns- 1 

Owns(B)  =  (J  (fs  +  i' :  Is  ■  ss) 

i'= o 

The  revised  Local-Index-Send  algorithm  is  the  following: 

(x\,yi,g\)  =  euclid(sAi  sp) 

{xi,yi,gi)  =  euclid(sBsp/gt,ss) 

stride  =  sBspss/(gig2) 

lmstride  =  sBspns/(gyg2) 

last  =  min (fB  +  sB[(min(lA,lp)  -  /a)/»a\Js) 

lmlast  =  [(last  -  fs)/ss\ns  +  min((last  -  fs)  mod  55,715  -  1 )  +  K 

c  =  max(/B  +  sB\(max(fA,fp)  -  fA)/sA],fs) 

IF  Ua  ~  fv )  mod  g{  ==  0  THEN 

DO  i  =  f (fs  -  fB  -  (fv  -  fA)x\sBlg\)lg{\  UPTO 
[(fs  -  fB-(fv-  fA)X\SB/g\  +715-1  )!gi\ 
i'  =  <72*  -  fs  +  fB  +  (fv  -  fA)x\SB/g[ 
r  =  fB  +  (fv  -  fA)x\sBlg\  +  ix2sBsp/gi 
first  =  c  +  ((r  —  c)  mod  stride) 
lmf irst  =  (first  -  fs  -  i')ns/ss  +  i1  +  K 
output-  local  -  s  1  ice(  lmf  irst,  lmlast ,  lmstride ) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(x\,V\,g\)  =  euclid(s4,si>) 

(X2,y2,g2)  =  QUcIid(sBsp/g\,ss) 

stride  =  sAspss/(g\g2) 

lmstride  =  sAss/(g\g2) 

last  =  min(lA,lp,fA  +  sA[(ls  -  /b)/sbJ) 

lmlast  =  [(last  -  fp)/sp J  +  K 

c  =  max(fA,fp,fA  +  sA[(fs  -  fB )/sB] ) 

IF  (/a  -  fp)  mod  g\  ==  0  THEN 

DO  i  =  r(/5  -  fB  ~  (fv  ~  fA)x\sB/g\)/g2]  UPTO 
[(fs  ~  fB  ~  (fv  ~  fA)x\sB/g\  +  775-1 )/ g2\ 
r  =  f a  +  ( fv  ~  fA)x\sA/g\  +  ix2»Asp/g\ 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  (first  -  fv)/sp  +  K 
input-local-slice(  lmf  irst.  lmlast.  lmstride) 

B.7  Case  G:  cyclic,  block 

Array  A  is  cyclic,  and  B  is  block.  The  ownership  sets  are  given  by: 
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Ownx >(A)  =  ( ft>  :  Zp  :  sp) 

Owns(B)  =  (fs  '■  fs  +  ns  -  1:1) 

The  revised  Local-Index-Send  algorithm  is  the  following: 

(*l>yi»0l)  =  euclid(54,^zj) 
stride  = SB$v/g\ 
lmstride  =  sBsiy/g  i 

last  =  min(/s  +  sB[(min{lA,b)  -  fA)/sA\,fs  +  ns  -  1) 
lmlast  =  min(last  -  fs,  ns  -  1 )  +  K 
c  =  max(/a  +  sB\(max(fA,  fo)  -  fA)/sA],fs) 

IF  ( f A  -  /p)  mod  gi  ==  0  THEN 
r  =  /b  +  (fv  -  fA)xisB/g\ 
first  =  c  +  ((r  —  c)  mod  stride) 
lmf irst  =  first  -  fs  +  K 

output-  local  -  si  ice(  lmf  irst,  lmlast ,  lmstride ) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

(x\,y\,9i)  =  euclid(34,sp) 
stride  =  sAst>/g  i 
lmstride  =  sA/g\ 

last  =  mm(lA,h,fA  +  sA[(fs  +  ns  ~  1  -  fB)/sB J ) 
lmlast  =  [(last  -  /p)/sp J  +  K 
c  =  max(  fAJv,fA+sA\(fs-  fB )/ ) 

IF  {fA  ~  fv)  mod  g\  ==  0  THEN 
r  =  fA  +  (fv~  fA)xisA/g\ 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  (first  -  fv)hv  +K 
input-local-slice(  lmf  irst,  lmlast,  lmstride) 

B.8  Case  H:  cyclic,  cyclic 

Array  A  is  cyclic,  and  B  is  cyclic.  The  ownership  sets  are  given  by: 


Ownx>( A)  =  (fy  :  Zp  :  .?p) 
Owns( B)  =  (fs  '■  Is  '■  *s) 

The  revised  Local-Index-Send  algorithm  is  the  following: 


45 


(xi,Vi,gi)  =  auclid(s^,5-p) 

(X2,V2,92)  =  euclid(sBsp/<7i,s.s) 

stride  =  sBsvss/(g\g2 ) 

lmstride  =  sBs-p/(g[g2) 

last  =  min(/B  +sB[(min(/A,/p)  -  /a)/^a\Js) 

lmlast  =  [(last  -  fs)fss\  +  K 

c  =  max(/B  +  sB\{max(fA,fv)  -  f a)/*a]Js ) 

IF  (/ A  -  fv)  mod  5!  ==  0  AND  (fB  +  (  ft,  -  fA)xisB/gi  -  fs)  mod  52  ==  0  THEN 
<  =  (/©-  /a)x\sB/9i 
r  =  /b  + 1  +  (fs  -  /b  -  t)xisBsv/(g\g2) 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf irst  =  (first  -  fs)/$s  +  K 

output-local-slice(  lmf  irst,  lmlast,  lmstride) 

The  revised  Local-Index-Receive  algorithm  is  the  following: 

{xi,Vi,g\)  =  euclid(s.4,si>) 

(t2,y2,02)  =  euclid(sBst)/gi,S5) 

stride  =  sAsvss/(g\g2) 

lmstride  =  sAss/(gig2) 

last  =  min (lA,lv,fA  +  sA[(ls  -  fB)l»B\) 

lmlast  =  [(last  -  fv)/sv J  +  K 

c  =  max(/4  JvJa  +  sA\{fs  -  fB  )/sB  ] ) 

IF  (/a  -  fv)  mod  g\  ==  0  AND  ( fB  +  (fv  -  fA)x\sB/g{  -  fs)  mod  52  ==  0  THEN 

t  =  (fv  ~  fA)x\SB/g\ 

r  =  fA  +  (fv  -  fA)x\sA/g\  +  (fs  -  fB  -  t)x2SAST>/(gigz) 
first  =  c  +  ((r  -  c)  mod  stride) 
lmf  irst  =  (first  -  /p)/sp  +  K 
input-local-slice(  lmf  irst,  lmlast.  lmstride) 


C  Optimized  computation  for  block  and  cyclic  distributions 

As  in  Appendix  B,  we  can  use  an  optimized  version  of  the  Compute-Loop  algorithm  when  the  distribution 
of  the  array  on  the  left-hand  side  of  the  assignment  statement  is  known  to  be  block  or  cyclic.  Since  the 
distribution  of  only  the  left-hand  side  is  relevant,  we  need  to  derive  only  two  additional  algorithms,  rather 
than  eight. 

C.l  Case  A:  block 

Array  A  has  a  block  distribution.  The  ownership  set  is  given  by: 

Ownt>(k )  =  (fv  ;  fv  +  nv  —  1  :  1 ) 

The  revised  COMPUTE-LOOP  algorithm  is  the  following: 
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stride  =  54 

lmstride  =  54 

last  =  min (U,fv  +  np  -  1) 

lml ast  =  last  -  fv  +  K 

c  =  ma x(fA,fv) 

r  —  f A 

first  =  c  +  ((r  -  c)  mod  stride) 

lmf irst  =  first  -  fv  +  K 

DO  i  =  lmfirst  UPTO  lmlast  BY  lmstride 

A  (*)  =  ^(T(t)) 

C.2  Case  B:  cyclic 

Array  A  has  a  cyclic  distribution.  The  ownership  set  is  given  by: 

Oump(A)  =  (fv  :  lv  ’■  $v) 

The  revised  COMPUTE-LOOP  algorithm  is  the  following: 

(xuyi,gi)  =  euclid(54,5p) 

IF  ( fA  -  fv)  mod  gi  ==  0  THEN 
stride  =  sAsv/g  1 
lmstride  =  sA/g 1 
last  =  min(lAJv) 
lmlast  =  [(last  -  fv)/sv\  +  K 
c  =  max(  fAifv) 
r=fA+  (fv  -  fA)x\sA/gi 
first  =  c  +  ((>  -  c)  mod  stride) 
lmfirst  =  [(first  -  /p)/spj  +  K 
DO  i  =  lmfirst  UPTO  lmlast  BY  lmstride 
A(i)  =  jF(T(i)) 


D  Notation 

There  are  many  variables  and  conventions  used  throughout  the  paper.  In  this  appendix,  we  provide  a 
description  of  many  of  these  variables. 

$ :  The  set  of  global  array  elements  to  be  sent  from  one  processor  to  another. 

A:  The  array  on  the  left-hand  side  of  the  assignment  statement. 

B:  The  array  on  the  right-hand  side  of  the  assignment  statement. 

6,:  The  alignment  offset  for  an  array  dimension.  The  subscript  refers  to  either  a  particular  dimension  of  a 
template  or  a  particular  one-dimensional  array. 

c,:  The  lower  index  bound  of  a  template  dimension  with  which  a  particular  array  dimension  is  aligned.  The 
subscript  refers  to  either  a  particular  dimension  of  a  template  or  a  particular  one-dimensional  array. 


47 


T:  An  arbitrary  element-wise  intrinsic  function,  such  as  sin,  log,  or  the  identity. 

/:  The  first  component  of  a  slice,  as  in  (/  :  l  :  s). 

/a’-  The  first  component  of  the  slice  on  the  left-hand  side  of  the  array  assignment  statement,  as  in  A(  fA  : 
l a  :  -s.4)  =  ^(B (Jb  ■  Ib  ■  *b))- 

Zb-  The  first  component  of  the  slice  on  the  right-hand  side  of  the  array  assignment  statement,  as  in 
A(/U  '■  l a  :  5.4 )  =  ^r(B(/s  :  Ib  ■  sb)). 

fp :  The  index  of  the  first  element  of  the  left-hand  side  array  owned  by  receiving  (destination)  processor  V. 

fs :  The  index  of  the  first  element  of  the  right-hand  side  array  owned  by  sending  processor  S. 

g :  The  greatest  common  divisor  (GCD)  of  some  pair  of  input  integers. 

K:  The  first  index  of  a  particular  array.  In  the  C  programming  language,  K  is  always  0;  in  Fortran  K  is 
usually  1. 

/:  The  last  component  of  a  slice,  as  in  (/  :  /  :  s). 

Ia-  The  last  component  of  the  slice  on  the  left-hand  side  of  the  array  assignment  statement,  as  in  A(  : 

iA-sA)  =  mfB-.iB-sB)). 

lB •  The  last  component  of  the  slice  on  the  right-hand  side  of  the  array  assignment  statement,  as  in  A(  /,t  : 
l. 4  ■  sa)  =  ^r(B(/s  :  lB  '■  5fl))- 

Ip:  The  index  of  the  last  element  of  the  left-hand  side  array  owned  by  receiving  (destination)  processor  P. 

Is’-  The  index  of  the  last  element  of  the  right-hand  side  array  owned  by  sending  processor  5. 

LM:  A  function  that  maps  a  global  array  index  into  an  index  of  a  local  array  on  a  processor.  L\l  is  defined 
in  equation  (16). 

Map:  A  function  that  maps  a  global  index  of  an  array  into  an  index  of  a  per-processor  local  array.  Map  is 
defined  in  equation  (5). 

np:  The  block  size  of  the  distribution  of  the  left-hand  side  array. 

ns:  The  block  size  of  the  distribution  of  the  right-hand  side  array. 

n':  The  block  size  of  the  template  with  which  the  array  is  aligned.  This  parameter  differs  from  the  block 
size,  n,  of  the  array  when  oL  oR.  The  identity  n'  =  n  4-  <>L  -  olt  always  holds. 

oL:  The  alignment  overlap  on  the  left.  This  value  is  0  except  in  the  case  of  partially  replicated  distributions 
(see  Section  10.4). 

oR:  The  alignment  overlap  on  the  right.  This  value  is  0  except  in  the  case  of  partially  replicated  distributions 
(see  Section  10.4). 

Ownp(A):  The  set  of  indices  corresponding  to  the  elements  of  left-hand  side  array  A  owned  by  receiv  ing 
(destination)  processor  T>.  This  set  can  be  represented  using  the  four  parameters  /p ,  / p .  ' p .  and  nP. 
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O  wns(B):  The  set  of  indices  corresponding  to  the  elements  of  right-hand  side  array  B  owned  by  sending 
processor  S.  This  set  can  be  represented  using  the  four  parameters  fs,  Is .  and  ns- 

pi  A  processor  ID  expressed  as  a  vector,  with  one  component  for  each  dimension  of  the  array,  where 
0  <  p  <  v.  Two  different  processors  will  differ  in  at  least  one  component  of  their  corresponding  p 
vectors. 

'Rj  j  i  A  function  that  returns  a  representative  of  an  input  set.  The  subscript  of  7c  is  an  index  list.  For  each 
valid  instantiation  of  the  indices  in  the  index  list,  there  is  a  different  representative.  Associated  with 
1Z  is  a  lower  bound,  an  upper  bound,  and  a  stride,  each  of  which  is  independent  of  the  indices  in  the 
index  list.  Each  representative,  along  with  the  lower  bound,  upper  bound,  and  stride,  defines  a  slice, 
and  the  union  of  the  slices  over  the  valid  instantiations  of  the  index  list  results  in  the  input  set. 

r:  A  representative  element  of  an  infinite  slice,  as  described  in  Section  5. 

s:  The  stride  component  of  a  slice,  as  in  ( /  :  /  :  s). 

s,\:  The  stride  component  of  the  slice  on  the  left-hand  side  of  the  array  assignment  statement,  as  in 
A  (/a  ■  U  ’■  sa)  =  -^(3  (Ib  :  Ib  ■  sb))- 

sb:  The  stride  component  of  the  slice  on  the  right-hand  side  of  the  array  assignment  statement,  as  in 
A(/.4  :  U  ■  sa)  =  :  Ib  ■  sb))- 

sp:  The  stride  of  the  distribution  of  the  left-hand  side  array.  The  distribution  stride  is  defined  as  the 
difference  between  the  starting  points  of  two  consecutive  blocks  owned  by  a  processor. 

■ss’.  The  stride  of  the  distribution  of  the  right-hand  side  array.  The  distribution  stride  is  defined  as  the 
difference  between  the  starting  points  of  two  consecutive  blocks  owned  by  a  processor. 

v:  A  vector  giving  the  number  of  processors  assigned  to  each  corresponding  dimension  of  an  array. 

x:  An  integer  that  satisfies  ax  +  by  =  gcd (a.b)  for  nonnegative  integers  a  and  6,  for  an  integer  y.  Given  a 
and  6,  the  integers  x,  y,  and  gcd(a,  b)  can  be  found  by  using  Euclid’s  extended  GCD  algorithm. 

y:  An  integer  that  satisfies  ax  +  by  =  gcd(a.h)  for  nonnegative  integers  a  and  b,  for  an  integer  .r.  Given  a 
and  6,  the  integers  x,  y,  and  gcd(a,6)  can  be  found  by  using  Euclid’s  extended  GCD  algorithm. 


4 


49 


REPORT  DOCUMENTATION  PAGE 


OM*  No.  0704-019$ 


» tm tafattntf  imihwmw  *— m— w  iinnn  I  nowraf  w—w.  mdnAHniMuiwtacmiywnn MncMnuMdi m«  wrtg amwai 
NMl  WWMf  ttV  (SNKMH  Ol  MomMloK-  Iwi  COMRfffl  ffIMMf  IMl  btrttfl  tVCtMMCff  Of  I^T  OfO®r  KBM  Of  tMt 

■UK—  Wr foAodf dm ixw,  to  «mddod*oo hmoww i<rwc»i otucww far  Mf  ofmonoo  Ourmew id  Mpom.  Hi*  loHomo 
V*  HMMU  idtotl  Ofdct  of  Mwmiiwi  id  IX|H.  >mn<W  OodwUon  Won  |»70«  *Oi  WnMnpen.  OC  10MJ. 


1 1.  AGENCY  USi  ONLY  (UM  Man*)  1 2.  REPORT  OATE 


I  3.  IMPORT  TYPE  AMO  OATES  COVERED 


4.  nTU  AND  SUBTITU 

Efficient  Compilation  of  Array  Statements 
for  Private  Memory  Multicomputers 


I  S.  FUNDING  NUMBERS 


MDA972- 90-C-C035 


ft.  AUTHORS) 

James  M.  Stichnoth 


7.  PERFOAMMG  ORGANIZATION  NAME(S)  AND  ADORE  SS<ES) 

Carnegie  Mellon  University 


ft.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 

CMU-CS-93-1 09 


I  9.  SPONSORING /MONITORING  AGENCY  NAME(S)  ANO  ADDRESSEES) 


10.  SPONSORING  /  MONITORING 
AGENCY  REPORT  NUMBER 


11.  SUPPLEMENTARY  NOTES 


1 12*.  DISTRIBUTION /AVAILABItITY  STATEMENT 


12b.  DISTRIBUTION  COOF 


unlimited 


1 13.  ABSTRACT  {Maximum  200  wofOS) 


One  of  the  core  constructs  of  High  Performance  Fortran  (HPF) 
(see  title  page) 


1 14.  SUBJECT  TERMS 


IS.  NUMBER  OF  PAGES 


1ft.  PRICE  CODE 

17.  SECURITY  CLASSNICATION  I  IB.  SECURITY  CLASSIFICATION  I  It.  SECURITY  CLASSIFICATION  To!  LIMITATION  OF  ABSTRACT 

OP  REPORT  I  OP  THIS  PAGE  OP  ABSTRACT 


NSN  7540-01-200-5500 


Standard  form  298  (Rtv  2-89) 
»nicfim  *  ariv  wi 

m- 102 


53-85 


School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh,  PA  15213-3890 


